1. Introduction
Forest ecosystems are being reshaped by interacting biotic and abiotic disturbances, and the frequency, spatial extent, and ecological severity of tree mortality events are increasing under warming climates, rising vapor pressure deficit, drought intensification, and simplified stand structure [
1,
2,
3]. In pine-dominated systems, mortality can be driven by pine wilt disease, bark- and wood-boring insects, defoliators, drought, fire, or compound disturbance cascades, yet these processes often converge on a recognizable crown-level syndrome characterized by chlorosis, crown thinning, progressive dehydration, and eventual needle reddening or greying [
4,
5,
6,
7,
8]. From a surveillance perspective, this convergence is important: management commonly needs to locate newly symptomatic trees rapidly, even before final attribution to a causal agent is completed in the field.
Rapid detection of wilted, severely damaged, or recently withered pines is operationally valuable for several reasons. First, symptomatic trees may act as local inoculum sources, vector-breeding substrates, or indicators of an actively expanding mortality focus, so delayed recognition can increase the cost and difficulty of sanitation, quarantine, and follow-up investigation [
9,
10,
11,
12]. This is especially important for trees that have newly entered the withered stage, because they can provide breeding substrates for secondary insects, especially longhorn beetles and bark beetles, and may therefore require prompt sanitation before secondary populations increase. Second, early mapping of newly affected crowns improves prioritization of field verification in landscapes where access is constrained by steep terrain, fragmented roads, and heterogeneous canopy conditions. Third, explicit detection of withered crowns supports post-disaster pest assessment and broader forest-health assessment because the abundance and spatial arrangement of needle-free dead trees provide a directly interpretable indicator of damage severity and recovery priority. Fourth, dry standing dead pines represent a potential fire-hazard component, so locating them can also support fuel-risk screening and removal planning. Symptom mapping can therefore support integrated monitoring systems in which remote sensing, ground diagnosis, hazard reduction, and management response are linked in a time-sensitive workflow rather than treated as independent tasks.
Unmanned aerial vehicles (UAVs) have become a central tool for this type of surveillance because they bridge the gap between labor-intensive field surveys and coarser satellite monitoring [
13,
14,
15]. UAV platforms can be deployed on demand, can acquire centimetre-scale imagery over difficult terrain, and can produce observations suitable for crown-level analysis. Recent work has shown that UAV-based spectral and image-texture information can support early or intermediate-stage detection of pine decline, particularly when deep learning or advanced machine learning is used to exploit high-resolution crown patterns [
16,
17,
18,
19,
20,
21,
22,
23]. Broader reviews now emphasize that UAV systems are especially useful where the management objective requires flexible, repeated observations over limited to medium spatial extents and where the target signal is expressed at the individual-tree level [
23,
24].
Despite these advantages, most operational workflows still depend predominantly on nadir-view orthomosaics. This convention is understandable because nadir imagery is easy to interpret, integrates naturally with mapping workflows, and supports direct generation of orthophotos, canopy height products, or structure-from-motion outputs. However, top-down products also compress three-dimensional crowns into a single viewing geometry. As a result, diagnostically important features such as lateral needle droop, crown-edge collapse, asymmetric thinning, or partially occluded discoloration may be weakened or entirely missed, especially in mountainous forests with strong relief, self-shadowing, and variable background composition [
19,
25,
26]. A nadir orthomosaic can therefore be geometrically convenient while still being suboptimal as a detector input.
This issue has become more important as algorithmic performance has advanced rapidly. Pine wilt detection studies increasingly report strong results from CNNs, semantic segmentation frameworks, and improved YOLO-based detectors trained on UAV imagery [
17,
18,
19,
20,
21,
22,
27,
28]. More general reviews of artificial intelligence in forest health surveillance likewise show that disease and pest detection has become one of the dominant application domains for computer vision in forestry [
28]. Yet many published gains are difficult to interpret mechanistically because detector architecture, input sensor type, site conditions, and labeling protocols often change simultaneously. Consequently, the specific contribution of image-view geometry remains insufficiently resolved. Put differently, it is often unclear whether better results arise because the model is stronger, because the sensor is richer, or because the acquisition geometry simply exposes crown symptoms more clearly in the first place.
Oblique UAV imaging is a plausible way to address this limitation. By capturing lateral crown surfaces and additional structural context, oblique photographs can reduce the effective loss of information caused by strict top-down observation. Oblique image networks also align naturally with direct-photo workflows and photogrammetric localization strategies, which may shorten the interval between image acquisition and field response. Previous studies have shown that oblique photogrammetry can improve retrieval of canopy structure proxies such as leaf area index and can enrich structure-from-motion reconstruction by adding complementary viewing angles [
25,
26]. In the pine wilt domain, the practical value of richer viewing geometry is also reinforced by recent work on unattended UAV hyperspectral systems, multispectral object detection, and explainable or lightweight models, all of which underscore the importance of preserving diagnostically relevant cues while remaining operationally efficient [
21,
27,
29,
30].
Against this background, the present study evaluates viewing geometry as the central operational factor while keeping the study area, survey period, UAV platform, SDR-based symptom categories, and training workflow fixed. Specifically, we compare a nadir dataset (D1), an oblique dataset (D2), and a simple mixed-view concatenation dataset (D3) for crown-level detection of wilt-affected pines using YOLO11. We emphasize that D3 is a baseline for naive pooling and not a dedicated multi-view fusion architecture. We also analyze an exploratory paired crown-level RGB subset so that interpretation of detector behavior is anchored in directly observed appearance differences while remaining appropriately cautious about sample size and radiometric effects.
Figure 1 summarizes the conceptual contrast between nadir orthophotography and oblique image-based monitoring. Our objectives were to: (i) quantify whether oblique RGB imagery improves rapid crown detection over nadir imagery in the same campaign; (ii) assess whether any advantage is associated with measurable, but exploratory, shifts in crown-level RGB profile; and (iii) evaluate the operational relevance of direct inference on geotagged UAV photographs for rapid forest health surveillance in rugged pine landscapes. We do not attempt causal attribution of wilt symptoms to a single agent, nor do we use the paired RGB subset as proof of a physiological mechanism; instead, it is used as a secondary explanatory complement to the detector comparison.
2. Materials and Methods
2.1. Study Site and UAV Data Acquisition
The field campaign was conducted in April 2025 in pine forests distributed across Malipo, Xichou, and Maguan counties, Wenshan Prefecture, Yunnan Province, China. The study region is characterized by mountainous terrain, locally steep slopes, fragmented land cover, and variable crown exposure conditions, all of which create a demanding visual environment for remote detection of individual symptomatic trees. These characteristics make the area an appropriate testbed for evaluating whether image perspective meaningfully alters crown detectability under operational conditions. The study-area context and data-acquisition workflow are summarized in
Figure 2.
Images were acquired using a DJI Mavic 3E (RTK edition) equipped with a 4/3 CMOS RGB camera with a 24 mm equivalent focal length (
Figure 2a). Terrain-following flight was implemented at an altitude of 150 m, corresponding to an approximate ground sample distance of 4.03 cm. Nadir and oblique images were collected during the same campaign so that viewing geometry could be compared without confounding by season, stand condition, or sensor platform. The oblique dataset was acquired at tilt angles of approximately 45–70°, providing lateral views of crown form in addition to top-visible crown surfaces.
During the same survey period, ground teams geotagged 3217 pine trees and quantified crown damage using the shoot damage ratio (SDR), a widely used metric for crown injury defined in the Standard of Forest Pest Occurrence and Disaster (LY/T 1681-2006) as the proportion of damaged shoots to the total number of shoots in an individual tree crown based on detailed shoot counts [
31]. For object detection, the field ratings were consolidated into three operational classes: early-stage (SDR < 50%), severely damaged (SDR ≥ 50%), and withered (trees with no needles). This consolidation was used only to align the detector with operational triage needs; it does not revise the official five-level standard or imply definitive causal diagnosis. Potential decline drivers, including biotic agents, drought, fire, and wind damage, were also recorded to support interpretation of the image-analysis results and crown-level labels in
Figure 2.
2.2. Dataset Construction
Two image products were constructed from matched UAV acquisitions over the same stands (
Figure 2c). Dataset D1 consisted of nadir-view orthophotography, whereas Dataset D2 consisted of oblique RGB photographs acquired at 45–70° tilt angles. A third dataset, D3, was created by merging D1 and D2 to test whether simple multi-view concatenation improved performance relative to single-view training. The central experimental logic was therefore not to compare different platforms or different sites, but to compare how the same biological target was represented under different viewing geometries.
All datasets were organized for the same three object-detection classes derived from the SDR-based field ratings: early-stage (SDR < 50%), severely damaged (SDR ≥ 50%; abbreviated as ‘severe’ in model outputs), and withered (needle-free crowns). These classes were defined as operational bins along a continuous damage and vigor gradient. Borderline crowns were assigned according to the dominant visible condition, the SDR threshold, and quality-control review. Bounding boxes were assigned at the crown level by trained image analysts using field geotags, SDR-based field records, and visible crown symptoms as anchors. Each annotated image was checked in a second quality-control pass, and ambiguous boxes or severity labels were resolved by consensus with the field records. The validation and test subsets each contained 50 crowns per class, and the training subsets were assembled to preserve an approximately balanced class composition. To reduce leakage from spatial autocorrelation, data partitioning was conducted at the image-block/flight-segment level rather than by randomly splitting individual bounding boxes. Crowns from the same original orthophoto tile, oblique image sequence, or immediately adjacent flight segment were kept in the same subset where possible. Because the imagery came from a single regional campaign with overlapping UAV coverage, residual spatial autocorrelation cannot be excluded; therefore, the reported metrics are interpreted as within-campaign validation rather than external-site transfer performance.
This dataset design is important for interpretation. If D2 outperformed D1, the advantage could not be attributed to a different site, time period, platform, or symptom taxonomy. However, the comparison should be understood as an operational contrast between nadir and oblique acquisition configurations: oblique imagery changes not only the nominal camera angle but also visible crown sides, background composition, illumination geometry, and self-shadowing. Conversely, if D3 failed to outperform D2, the result would not demonstrate that multiple views are redundant. It would show only that simple sample-level concatenation, without view labels, view-aware embeddings, or explicit cross-view fusion, was insufficient under the present training design.
2.3. YOLO11 Model Development and Evaluation
We adopted the Ultralytics implementation of YOLO11 as a one-stage detector for crown localization and class assignment [
29]. YOLO-family detectors are widely used because they combine end-to-end detection with comparatively efficient training and inference, thereby making them attractive for operational workflows and edge-oriented deployment. For clarity, the network diagram in
Figure 2d follows the standard YOLO11 notation: C3K2/C3k2 denotes compact CSP-style feature-extraction blocks, whereas C2PSA denotes a C2-style position-sensitive attention module used to strengthen spatially informative features before detection. These modules were used as implemented in YOLO11; no architecture-level innovation was introduced in this study. For context, YOLO11 belongs to a lineage that builds on advances from earlier object detectors including R-CNN, Faster R-CNN, SSD, and the original YOLO framework [
32,
33,
34,
35]. In this study, the model was not treated as the object of algorithmic innovation; rather, it served as a strong contemporary detector with which to test the effect of image perspective on downstream performance.
Original RGB images (5280 × 3956 pixels) were resized to 640 × 640 pixels for training efficiency and to maintain a consistent YOLO input format. This resolution was sufficient for crown-level bounding boxes but may attenuate fine needle-level symptoms; therefore, we interpret the task as crown-scale detection rather than diagnosis of sub-crown physiological detail. Three models were trained separately on D1, D2, and D3 using stochastic gradient descent for 500 epochs with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. Non-maximum suppression was used at inference to produce final detections. The same training configuration was retained across datasets so that differences in performance could be attributed as directly as possible to differences in the input data rather than to altered optimization settings. No view-specific hyperparameter tuning or post hoc parameter adjustment was introduced after preliminary comparisons, because doing so would have weakened the fairness of the view-geometry experiment. Inference in this manuscript was performed on post-flight geotagged UAV photographs. We did not conduct an onboard, in-flight latency or frames-per-second benchmark, so the evaluation reports detection accuracy and workflow readiness for rapid screening rather than live autonomous operation.
Model performance was compared using precision, recall, F1-score, mean average precision at an intersection-over-union (IoU) threshold of 0.5 (mAP@0.5), and mean average precision averaged over IoU thresholds from 0.50 to 0.95 in 0.05 increments (mAP@0.5:0.95), together with confusion matrices and loss trajectories. IoU measures the overlap between predicted and reference crown boxes. Precision reflects the fraction of predicted crowns that were correct, recall reflects the fraction of true target crowns detected, and the F1-score summarizes their trade-off. mAP@0.5 reflects average precision under a relatively permissive localization threshold, whereas mAP@0.5:0.95 provides a stricter measure of localization quality across multiple thresholds. Because the practical goal of this work is timely crown detection, special attention was paid to omission patterns, background confusion, and the stability of validation performance across training, rather than to training loss alone. In deployment, the detector can be applied directly to geotagged UAV photographs, reducing the dependence on full orthomosaic production and thus shortening the lag between data capture and management response.
2.4. Paired Crown-Level RGB Analysis
To provide explanatory support for the detector results, we analyzed the RGB measurements as a paired crown-level dataset. It contained 20 crowns measured in both nadir and oblique views, with one matched observation per view for each crown ID. For each crown, mean red (R), green (G), and blue (B) values and their within-crown standard deviations were available. RSD, GSD, and BSD denote the within-crown standard deviations of red, green, and blue pixel values, respectively. We additionally calculated four RGB-derived indices: Excess Green (ExG = 2G − R − B), Excess Red (ExR = 1.4R − G), VARI = (G − R)/(G + R − B), and the R/G ratio. These variables were chosen because they summarize different but complementary aspects of crown appearance, including greenness balance, red dominance, and within-crown heterogeneity.
Because the same crowns were observed from both viewpoints, all modality comparisons were treated as paired analyses. For each feature, we calculated the mean paired difference (oblique—nadir), a 95% confidence interval, Wilcoxon signed-rank
p value, and paired standardized effect size (dz). A multivariate paired profile test (Hotelling’s T2) was used to evaluate whether the 10-feature RGB profile differed jointly between viewing modes. Given the modest sample size (
n = 20 matched crowns) relative to the number of features, this multivariate test was treated as an exploratory profile-level statistic rather than as stand-alone evidence for broad generalization. Principal component analysis (PCA) was then applied to the standardized feature matrix to summarize the dominant axes of crown-level variation. This design is more defensible than an unpaired comparison because it isolates the effect of image geometry from tree-to-tree biological variability.
Table 1 presents the main outcomes of this analysis. Because the paired subset comprised only 20 matched crowns, the RGB results were interpreted primarily as explanatory evidence rather than as a stand-alone inferential endpoint.
The RGB analysis was not intended to replace the detector results, explain neural-network attention with certainty, or claim a complete physiological mechanism of wilt severity. Instead, it was used as a secondary observational bridge between image perspective and detection behaviour. If oblique imagery changed the multivariate crown profile in consistent ways, those shifts could provide a plausible phenotypic context for why one model generalized more cleanly than another within this dataset.
3. Results
3.1. Survey Context
Field observations showed a substantial pool of symptomatic trees across all three severity classes, indicating an actively developing decline landscape rather than a dataset dominated by only terminally dead crowns. The early-stage:severely damaged:withered ratio was approximately 1.15:0.72:1.30, implying that both pre-mortality and post-mortality crowns were well represented during the campaign. This distribution is useful for detector evaluation because it requires the models to handle both subtle and visually conspicuous expression of decline.
When decline or mortality was classified by dominant recorded cause, fire accounted for 27.07% of cases, followed by wind breakage (12.59%) and drought (8.39%). Biotic agents contributed 9.48%, while the remaining observations were attributed to other or uncertain causes. As shown in
Figure 3, affected trees were not distributed uniformly but formed localized clusters within the study landscape. These results reinforce that the detector should be interpreted as a symptom-recognition tool rather than a direct etiological classifier. In practice, the workflow is best suited for highlighting crowns that warrant field confirmation, sanitation, or targeted diagnostic follow-up.
3.2. RGB Profile Analysis Based on Paired Crowns
The RGB measurements revealed a modest and exploratory multivariate shift between nadir and oblique observations. When the 10-feature RGB profile (R, G, B, RSD, GSD, BSD, ExG, ExR, VARI, and R/G) was analysed jointly, the paired modality effect was significant (Hotelling’s T2 = 58.91, F = 3.10,
p = 0.044). This indicates that the two viewing modes captured measurably different crown appearance profiles in this paired subset, even though the same 20 crowns were analysed in both views. The paired summary statistics are presented in
Table 1, and the distributions and multivariate separation are shown in
Figure 4. Because the matched sample size was limited, this result should be interpreted as exploratory support for the detector comparison rather than as a definitive generalization about all wilt-affected crowns.
At the single-feature level, however, most channel means remained comparatively stable between modalities. Mean red, green, and blue intensities showed no statistically significant paired differences (Wilcoxon p = 0.330, 0.261, and 0.985, respectively). Likewise, the R/G ratio and VARI were nearly unchanged on average. The clearest view-dependent shift occurred for Excess Green, which decreased under oblique imaging by 5.77 units on average (95% CI: −9.79 to −1.74; Wilcoxon p = 0.011; dz = −0.67). In practical terms, this suggests that oblique views tended to reduce apparent green dominance without producing a simple, uniform brightness shift across all crowns.
Within-crown heterogeneity showed a more nuanced pattern. Average within-crown standard deviations were slightly lower in oblique images, but the between-crown spread of several traits increased markedly, especially for blue-channel intensity and blue-channel standard deviation. The oblique-to-nadir variance ratio reached 6.70 for mean blue intensity and 1.91 for blue-channel within-crown standard deviation. This means that oblique imagery did not move all crowns in the same direction; rather, it amplified inter-crown contrast for a subset of blue-sensitive or structurally sensitive traits. Such behaviour is consistent with stronger exposure of shaded crown facets, skeletal branch structure, desiccated tissues, and laterally visible crown sectors that are incompletely captured in top-down views, but it may also partly reflect directional illumination, self-shadowing, and sun-sensor geometry associated with the 45–70° oblique acquisition.
PCA further clarified the structure of this shift. As shown in
Figure 4D, the first two axes explained 66.7% of the standardized variance (PC1 = 35.7%, PC2 = 30.9%). PC1 was driven mainly by red-dominance and vegetation-balance variables, especially ExR, R/G, and mean red/green behaviour, whereas PC2 was dominated by the three heterogeneity measures. The partial separation of nadir and oblique observations in PCA space therefore reflects a combined shift in color balance and textural heterogeneity rather than a simple gain in luminance. In other words, oblique images may reweight which parts of crown appearance are most visible to the model, rather than merely changing overall brightness.
Taken together, the RGB analysis supports a cautious explanatory interpretation: the advantage of oblique imagery did not appear to arise solely because crowns became brighter or more saturated, but because chromatic, shadow-related, and structural cues were redistributed in a way that plausibly improved crown-level contrast. This interpretation aligns the detector results with directly measured crown-level evidence while avoiding an overextended physiological or causal claim from RGB data alone. The RGB analysis should therefore be read as a parsimonious observational bridge between viewing geometry and downstream detectability, not as proof of how the neural network made every decision.
3.3. Model Performance Comparison
The three YOLO11 models exhibited clear performance differences within the same-campaign validation setting. The oblique-image model (D2) outperformed both the nadir model (D1) and the mixed-view concatenation model (D3) across all major validation metrics (
Figure 5A–C). Based on peak validation performance, D2 achieved the highest precision (0.994), recall (0.991), F1-score (0.989), mAP@0.5 (0.995), and mAP@0.5:0.95 (0.880), substantially exceeding the corresponding values of D1 (0.880, 0.799, 0.809, 0.859, and 0.586) and D3 (0.856, 0.821, 0.825, 0.877, and 0.617). The magnitude of the difference in mAP@0.5:0.95 is especially important because it shows that the superiority of D2 extended beyond coarse crown recognition to more precise localization. These high D2 metrics should be interpreted in the context of a controlled, consistently annotated dataset collected under matched site and sensor conditions; they are not an external-transfer benchmark.
The training trajectories in
Figure 5B,C reinforced this pattern. D2 rapidly converged toward a substantially higher validation plateau and maintained that advantage through most of the training process. D1 and D3 followed smoother but lower trajectories. Notably, D3, despite containing more total samples, improved only marginally over D1 and remained far below D2. This indicates that under the present non-view-aware training design, simple sample pooling did not yield additive gains. The result should not be interpreted as evidence against multi-view imagery; rather, it suggests that effective multi-view use likely requires explicit view encoding, cross-view correspondence, or dedicated fusion modules.
The confusion matrices in
Figure 5D–F explain where these aggregate differences came from. D2 showed near-diagonal dominance, with perfect allocation for the early-stage and withered classes and a severely damaged-class accuracy of 97.4%, with only limited leakage to the withered class. By contrast, D1 and D3 both exhibited substantial target-to-background confusion. In D1, 15.0% of true early-stage crowns, 19.9% of true severely damaged crowns, and 12.7% of true withered crowns were assigned to the background class. D3 reduced some of these errors but still showed considerable omission relative to D2. This means that the performance advantage of D2 was not merely a numerical improvement in a summary statistic; it translated directly into fewer missed symptomatic crowns and cleaner class separation.
From an operational standpoint, this result is important, but its scope should remain bounded. A surveillance system that misses weakly expressed or partially occluded crowns is less valuable than one that identifies them consistently, even if both systems appear stable during training. Here, oblique imagery yielded the best balance between validation accuracy, omission control, and class separability, indicating that image perspective was a major determinant of detection quality under the conditions of this study. At the same time, the result should be interpreted as evidence for a strong operational advantage in this dataset, not as proof that oblique imagery will always outperform all alternative acquisition designs in other forest types, seasons, or illumination regimes.
3.4. Detection Visualization and Case Analysis
Qualitative inspection of the detection outputs further confirmed the quantitative results. As illustrated in
Figure 6, the oblique-trained model detected subtle or partially obscured symptomatic crowns more reliably than the nadir-trained model in several representative situations, including steep slopes, forest edges, occluded crowns, and shaded crowns. The differences were especially evident for weakly expressed early-stage targets and for crowns embedded in visually complex backgrounds. These examples are interpretive rather than statistical, but they are important because they show that the aggregate metric advantage of D2 translated into scenario-level behaviour that is directly relevant for field surveillance.
The Grad-CAM visualizations in
Figure 6 provide additional interpretive support. Although Grad-CAM does not furnish a causal explanation and cannot prove which physiological traits the network learned, it can indicate regions of high model sensitivity. D1 tended to emphasize coarse, high-contrast cues such as canopy apices, crown boundaries, and shadow edges, whereas D2 showed more crown-centered responses associated with plausible symptomatic features, including local texture discontinuities, laterally visible crown deformation, and internally heterogeneous crown sectors. The scenario-level summary likewise shows that D2 maintained the highest detection success rate across all tested conditions. Together, these results indicate that oblique imagery improved performance not only in average metrics but also in the specific conditions that matter most for practical forest surveillance.
4. Discussion
4.1. Why Oblique Imagery Improved Detector Performance
The central finding of this study is that operational acquisition geometry was a major determinant of detector performance. Because oblique imagery preserves lateral crown surfaces and retains more three-dimensional symptom expression than nadir products, it likely reduced the compression of diagnostically important features into a top-down canopy silhouette. In wilt-affected crowns, these features include partial discoloration, crown-edge collapse, needle droop, asymmetrical thinning, and localized desiccation. Such cues are often underrepresented in nadir orthomosaics, particularly in mountainous stands where self-shadowing, crown overlap, and background clutter are strong [
19,
25,
26]. However, view geometry should not be read as pure camera angle isolated from all radiometric factors: 45–70° oblique images necessarily change sun-sensor geometry, self-shadowing, visible background, and lateral structural context. The conclusion is therefore that the oblique acquisition configuration outperformed the nadir configuration under matched campaign conditions, not that camera angle alone was mathematically disentangled from illumination and bidirectional reflectance effects. This should not be interpreted as evidence that oblique imagery is universally optimal for every forest-health application; rather, it indicates that when the target symptom is partly expressed on lateral crown surfaces, acquisition geometry can become as consequential as model choice.
This interpretation is consistent with recent UAV-based pine wilt studies showing that detection performance depends strongly on the visual separability of diseased crowns from their surrounding canopy matrix [
17,
18,
19,
20,
21,
22]. However, most prior studies emphasized model design, sensor selection, or multispectral enhancement. By holding site, campaign period, severity classes, and detector family constant while changing viewing geometry, the present study isolates image perspective itself as an operationally important variable. That distinction matters because it reframes acquisition design as an equal partner to algorithm design. In practical terms, better detector performance may sometimes be achieved more efficiently by improving what the sensor sees than by continuing to increase model complexity alone.
4.2. What the RGB Analysis Adds
The RGB analysis strengthens this study only when interpreted at the appropriate scale. It anchors interpretation in directly measured crown-level observations, but the paired subset was small and should be treated as exploratory. Importantly, the paired analysis does not support the simplistic idea that oblique imagery merely increases overall brightness or uniformly inflates within-crown variability. Instead, the effect was selective. Green-dominance metrics weakened, blue-related traits became more dispersed across crowns, and the joint RGB profile changed at the multivariate level even when most individual channel means remained stable. Because the matched sample was limited, these patterns should be interpreted as preliminary support for the detector comparison rather than as a complete characterization of crown physiology.
This is a useful explanatory refinement. It suggests that the benefit of oblique imagery may lie less in a global radiometric shift and more in the reweighting of crown appearance cues that help the detector distinguish symptomatic crowns from the surrounding canopy. The stronger spread of blue-related responses may reflect exposure of shaded crown facets, structural skeleton, or desiccated tissues that are incompletely visible from above, while the reduction in ExG is consistent with diminished apparent canopy greenness once lateral and structurally degraded crown sectors are brought into view. Nevertheless, these same patterns may also be influenced by directional shading and canopy reflectance anisotropy. This interpretation remains observational rather than physiological, and it is now deliberately scaled to the available data.
The broader implication is that paired low-cost RGB analysis can provide useful, but preliminary, explanatory support for detector behaviour when more advanced spectroscopy is unavailable. Although RGB data cannot recover the biochemical sensitivity of hyperspectral sensors, it can still reveal whether a different acquisition geometry changes the multivariate crown signature presented to the model. For operational papers, that level of evidence can help move from empirical comparison toward plausible explanation, provided that the limits imposed by sample size, illumination, and radiometric control are stated explicitly.
4.3. Why the Mixed-View Dataset Did Not Outperform the Oblique-Only Model
A noteworthy result is that D3, which combined nadir and oblique samples, remained inferior to D2. This finding should not be interpreted as evidence that multi-view data are useless or biologically redundant. Instead, it shows that the present single-stream YOLO11 training design was not view-aware. From a representation-learning perspective, mixed-view training can enlarge intra-class variance because the same biological class is expressed through substantially different crown geometries, shadow patterns, scale relations, and background structures. Unless the detector is designed to encode view identity, align crowns across views, or fuse features explicitly, that added heterogeneity may dilute the highly informative features already present in oblique imagery.
Operationally, this means that practitioners should not assume that simply collecting more images or pooling more perspectives will necessarily improve performance. The D3 result identifies a limitation of the simplified concatenation strategy rather than a limitation of multi-angle imagery itself. Future work should therefore test dedicated multi-view fusion strategies, such as branch-wise feature extractors, view-aware embeddings, attention-based cross-view aggregation, photogrammetry-guided crown correspondence, or view-consistency losses, rather than relying on dataset concatenation as a substitute for model design.
4.4. Positioning the Findings Within Recent UAV Forest Health Literature
Recent literature provides an informative context for interpreting these results. Reviews of UAV-based forest health monitoring emphasize that UAV systems are particularly powerful when the target signal is expressed at fine spatial scales, when revisit flexibility is important, and when ground access is difficult [
23,
24]. Reviews of artificial intelligence in forest health likewise highlight that pest and disease detection has become a leading application domain, but they also note persistent challenges in transferability, explainability, and operational integration [
28]. The present study contributes to that discussion by showing that acquisition design itself can be a strong leverage point for improving recognition quality. In that sense, the manuscript sits between sensor-rich early detection studies and low-cost operational RGB workflows: it does not claim that RGB is the most physiologically sensitive modality, but it does show that RGB performance can change materially when view geometry is treated as a designed variable rather than a passive by-product of flight planning.
Within the pine wilt detection literature, multispectral and hyperspectral methods remain essential for early physiological detection and for capturing pre-visual stress signals [
16,
21,
27,
30]. In particular, hyperspectral approaches can detect subtle changes before RGB symptoms become visually obvious, and unattended or edge-oriented UAV systems are increasingly being developed to support higher-frequency monitoring [
27,
30]. Nevertheless, these richer sensors impose higher equipment, calibration, and processing demands. By contrast, RGB sensors remain inexpensive, lightweight, and operationally accessible. The main contribution of this study is therefore not to argue that RGB outperforms hyperspectral data in early physiological diagnosis, but to show that the value of low-cost RGB surveillance can be substantially increased when the imaging geometry is chosen to preserve crown-level structural cues that nadir views tend to suppress.
This distinction is practically important. Many management agencies or field teams can deploy RGB UAVs more readily than hyperspectral systems, especially across rugged, time-sensitive forest landscapes. For those users, the question is often not whether RGB is theoretically the best sensing modality, but how to extract the maximum operational value from RGB imagery that is already available. Our results indicate that, within the tested region and season, oblique acquisition is a practical and immediately actionable way to increase the information content of RGB imagery for crown-level screening.
4.5. Operational Implications and Georeferencing Potential
Beyond detection accuracy, oblique imagery offers a practical route to crown georeferencing and rapid post-flight response. Each oblique photograph is associated with camera position and attitude, and matching a detected crown across overlapping images makes it possible to recover object-space location through standard photogrammetric intersection principles [
13,
25]. This is particularly attractive for rapid-response workflows because it allows image-level detection and localization to begin before full orthomosaic production is completed for every mission. In mountainous terrain, that reduction in preprocessing overhead can shorten the lag between data capture and field deployment.
In management terms, the proposed workflow is best viewed as a high-throughput triage system rather than an autonomous live-detection system. It can rapidly highlight crowns that warrant field verification, sanitation removal, or causal diagnosis. Importantly, the withered class is not merely a terminal visual category of limited operational value. Trees that have recently reached this stage can become breeding substrates for secondary pests such as longhorn beetles and bark beetles, may require rapid clearing to reduce further pest pressure, provide essential evidence for post-disaster pest assessment and forest-health assessment, and contribute dry standing fuel that can increase local fire risk. Such a role is fully consistent with integrated forest health surveillance and early-warning systems, where UAV detection, field confirmation, sanitation, risk evaluation, and epidemiological interpretation operate as complementary steps rather than as substitutes for one another.
4.6. Limitations and Future Directions
Several limitations should be acknowledged. First, the modelling and image analysis were conducted within one mountainous region and one acquisition season; there was no independent external hold-out site, and geographic transferability across pine species, canopy densities, illumination conditions, seasonal stages, and flight settings remains to be tested explicitly. Second, the training/validation/test partition was designed to reduce spatial leakage at the image-block/flight-segment level, but residual spatial autocorrelation cannot be ruled out in UAV data with overlapping coverage. Third, the detector recognizes symptom expression rather than causal agent identity, which is important in landscapes where drought, fire, wind, and biotic agents can all generate visually similar crown phenotypes. Fourth, the paired RGB data captured only 20 matched crowns and 10 features; Hotelling’s T2, PCA, and effect-size results should therefore be considered exploratory, not definitive physiological validation or proof of neural-network attention mechanisms. Fifth, oblique imagery changes not only camera angle but also visible structure, self-shadowing, sun-sensor geometry, background exposure, and bidirectional reflectance effects. Without radiometric calibration, explicit BRDF modelling, or controlled illumination experiments, shading-related contamination of RGB signatures cannot be fully separated from true symptom expression. Sixth, D3 used simple mixed-view concatenation and did not include view labels, view-aware embeddings, cross-view correspondence, or branch-wise fusion. Future work should therefore evaluate external test sites, cross-season campaigns, latency benchmarking on field hardware, radiometrically calibrated imaging, explicit multi-view fusion networks, and uncertainty-aware annotation protocols.
Future work should extend the workflow across seasons, host conditions, and forest types; test external hold-out sites; integrate multispectral, hyperspectral, thermal, or 3D crown attributes; evaluate explicit multi-view fusion networks; and benchmark inference latency on defined field hardware. A particularly promising direction is the combination of oblique RGB detection with lightweight or explainable models that are suitable for onboard, low-latency deployment after speed and power constraints are quantified [
21,
27,
28]. Another is to formalize crown matching across overlapping oblique images so that photogrammetric localization becomes an integral output of the detector rather than a downstream add-on. A further priority is controlled angle optimization, in which 30°, 45°, 60°, and mixed-angle strategies are compared under the same labeling and training protocol. These steps would help determine whether the strong advantage of oblique imagery observed here remains stable across broader operational contexts and whether it can be leveraged for both faster surveillance and more precise intervention.
5. Conclusions
This study demonstrates that oblique UAV RGB imagery provides a substantially stronger basis for rapid detection of wilt-affected pine crowns than conventional nadir imagery within the tested mountainous region and acquisition campaign. Using the SDR-derived early-stage, severely damaged, and withered crown classes, and under matched acquisition conditions, the YOLO11 model trained on oblique images achieved the highest precision, recall, F1-score, mAP@0.5, and mAP@0.5:0.95, and it also showed the cleanest confusion structure and the fewest missed symptomatic crowns. The advantage of oblique imagery therefore reflects not only better aggregate metrics, but also more reliable detection under the visually difficult conditions that matter most in operational forest surveillance. These findings should be viewed as strong within-campaign evidence rather than as an external validation across all seasons, forest types, or illumination conditions.
The exploratory RGB analysis provides a plausible explanation for why this advantage emerged. Oblique imagery altered the joint RGB feature profile of crowns, reduced apparent green dominance, and expanded the dynamic range of blue-sensitive crown responses, thereby plausibly improving crown-level discriminability without relying on large shifts in average brightness alone. However, these patterns may also reflect directional shading, canopy reflectance anisotropy, and the small paired RGB sample, so they should not be overinterpreted as a definitive physiological mechanism. Taken together, these results support the prioritization of oblique UAV acquisition for rapid forest health surveillance in rugged pine landscapes and show that acquisition geometry should be treated as a core design variable, not merely a by-product of flight planning, when developing practical detector pipelines for wilt-affected forests. The explicit withered class further extends the workflow from early symptom screening to sanitation prioritization, post-disaster and forest-health assessment, and fire-hazard reconnaissance. Even so, broader cross-site validation, latency benchmarking, explicit multi-view fusion, and radiometric calibration remain essential next steps before operational transfer beyond the tested campaign.