1. Introduction
Tree species classification is an important part of sustainable forest management in Canada, supporting biodiversity conservation, carbon accounting, wildfire risk assessment, habitat protection, and the development of climate-resilient strategies [
1,
2,
3]. Accurate species-level information provides a solid basis for both conservation planning and industry applications. Identifying species at the individual tree level enhances forest inventories by producing detailed, high-resolution maps of forest composition, thereby supporting ecosystem monitoring, targeted management practices, and informed decision-making for conservation and resource use. Although significant research has been done in species classification using remote sensing methods, achieving a fully automated and reliable computational solution is still an ongoing task [
4,
5,
6,
7]. According to a recent review on forest inventories in Canada, automated tree species recognition and mapping were identified as a research priority by many provinces in Canada [
8]. Despite advances in remote sensing technology, most operational forest inventories continue to characterize forests at broad levels, such as percent hardwood and softwood or broad species mix categories in a given stand, often with limited accuracy or high uncertainty [
9]. Literature highlights the need for improved individual tree detection since species-specific data are crucial for timber and non-timber valuation, silviculture planning, pest and disease management, biodiversity assessment, and understanding forest succession.
Recent progress in artificial intelligence (AI), computational efficiencies, and the availability of high-resolution UAV (Unmanned Aerial Vehicles or drones) imagery have opened new possibilities for detailed forest monitoring, particularly at the individual tree level [
10,
11,
12]. Although ultra-high density lidar is explored for structurally separable species, given their ease in acquisition and better spectral separability, both RGB (natural colour imagery) and multispectral (MS) images from UAV have been widely used for tree species classification at the individual tree level [
13,
14,
15,
16]. Compared to RGB sensors, off-the-shelf UAV multispectral sensors offer additional advantages for species classification by sensing beyond the visible spectrum. For instance, MicaSense RedEdge, Altum, Parrot’s Sequoia, Sentera 6X Thermal Pro, etc., also capture in red-edge and near-infrared (NIR) bands. These additional bands can enhance tree species discrimination by capturing subtle spectral differences and thus resulting in more distinct leaf reflectance.
Despite these advances, existing tree species classification approaches remain constrained by several methodological and operational challenges. In mixed and structurally complex forests, species discrimination is difficult because different species may exhibit similar canopy appearances and overlapping spectral signatures. A recent work has shown that overlapping spectral responses can limit the separability of forest composition and diversity patterns, indicating that spectral information alone may not always clearly distinguish complex vegetation conditions [
17]. In addition, spectral reflectance can vary within the same species due to differences in canopy structure, biochemistry, physiology, phenology, tree health, illumination, and acquisition conditions [
17]. At the individual-tree level, these challenges are further amplified by crown overlap, forked crowns, variability in tree architecture, and tree density, which can affect both crown delineation and species classification accuracy [
10]. Similar crown shapes and spectral reflectance among species, particularly in mixed conifer–broadleaf forests, can further increase inter-species confusion [
10]. As a result, intra-species variability may be comparable to, or even greater than, inter-species differences, particularly for visually or spectrally similar species in complex mixed forests. Some previous studies have also highlighted that overlapping crowns, irregular canopy structures, spectral similarity between species, and within-species variability complicate both tree crown delineation and classification workflows [
16,
18]. In addition, classification performance is strongly influenced by spatial resolution and scale of analysis, as individual pixels in very high-resolution imagery may represent different canopy components, such as leaves, branches, bark, shadows, or canopy gaps, making the spectral signature of a single species difficult to define consistently [
16,
19]. Conventional machine learning approaches can perform well when suitable predictor variables are available; however, their success depends strongly on manually engineered spectral, textural, structural, or thermal features, and the most relevant features can vary with sensor type, forest structure, species separability, and study site conditions [
16,
19,
20]. Multi-sensor fusion can improve classification by combining complementary information from different sensors, but it also increases data acquisition, preprocessing, co-registration, and feature extraction requirements, which may limit operational scalability [
19,
20].
Several studies have reported high classification accuracies under specific data and sensor configurations. For example, UAV-based multispectral point-cloud classification using a dual attention graph convolutional network achieved an overall accuracy of 89.80% and a macro-F1 score of 87.80% [
21]. Similarly, deep learning models trained on UAV-based RGB crown images achieved strong performance in a temperate forest, with summer imagery producing an average F1-score of 0.96 and fall imagery producing F1-scores greater than 0.90 for several advanced models [
11].
LiDAR-based deep learning studies have further shown the value of three-dimensional structural information for tree classification. In one such study, PointCNN and 3DmFV-Net were evaluated for classifying broader tree categories, including coniferous trees, deciduous trees, and dead-tree classes, using airborne LiDAR data in combination with intensity and multispectral features. The results showed that PointCNN achieved a test accuracy of up to 87.0% when 3D point coordinates, laser intensity, and multispectral information were included, and that the addition of multispectral features improved classification accuracy by up to 16.3 percentage points [
22]. This highlights the importance of combining structural and spectral information, particularly when separating classes that cannot be reliably distinguished using geometry alone. However, the classification was conducted at the broader functional or structural class level rather than at the individual species level, indicating that species-level discrimination remains a more challenging task, especially in mixed forests with spectrally or structurally similar species. Similarly, a UAV-LiDAR-based comparison of machine learning and deep learning methods for four tree species found that PointMLP achieved the highest overall accuracy of 96.94%, followed by random forest and support vector machine models [
23]. However, this study also noted that species with similar crown structures were more prone to misclassification and that broader validation over larger areas is needed before such models can be operationally generalized.
Recent multisensor studies have also emphasized the benefits of integrating spectral, structural, and object-based information in complex forest environments. An object-based deep learning framework using UAV hyperspectral imagery and LiDAR data was developed for tree species classification in natural secondary forests [
24]. The workflow combined U-Net and SLIC for individual tree crown delineation and compared 1D-, 2D-, and 3D-CNN models with and without a convolutional block attention module. The addition of the attention mechanism improved the performance of all CNN models, and the 1D-CNN with attention achieved the highest overall accuracy when selected hyperspectral and LiDAR features were used. The study also showed that red-edge and near-infrared spectral features, texture measures, vegetation indices, and LiDAR height features contributed importantly to species discrimination. At the same time, it highlighted persistent challenges associated with overlapping crowns, crown-boundary delineation, input patch size, and the labour-intensive nature of producing labelled individual-tree samples.
Cross-platform LiDAR-based transfer learning has also shown strong potential for improving generalization across heterogeneous sensors and data-limited target domains. For example, a recent framework using geometry-consistent preprocessing, surface orientation, multi-scale density features, and staged fine-tuning achieved an overall accuracy of 94.8% and a mean F1-score of 91.9% when adapting from a multi-species pretraining dataset to an unseen UAV LiDAR target dataset [
25]. The same study further showed that transfer learning improved training efficiency compared with training from scratch, with substantially faster convergence under limited-data conditions, demonstrating its potential for scalable cross-platform point-cloud analysis in forest monitoring.
Collectively, these studies demonstrate that high tree species classification accuracies can be achieved using rich structural information, hyperspectral or multispectral features, data fusion, advanced deep learning architectures, attention mechanisms, or large and diverse training datasets. However, these high accuracies are generally achieved within specific domains, such as RGB imagery, hyperspectral imagery, multispectral point clouds, or LiDAR point clouds, and often depend on sensor-specific preprocessing, high-density point clouds, or favourable sample conditions. Most existing transfer learning studies in remote sensing focus on land-use classification, seasonal transfer, or single-modality workflows, while cross-sensor transfer for individual tree species classification remains relatively limited. Therefore, despite strong reported performance in recent studies, improving generalization across sensors, spatial resolutions, and labelled-data availability remains an important research gap.
Beyond traditional remote sensing and machine learning algorithms, recent years have seen the wide adoption of Convolutional Neural Networks (CNNs) for tree species classification due to their ability to learn hierarchical spatial and spectral representations directly from image data [
2,
11,
26]. Conventional machine learning classifiers, such as Random Forest (RF) and Support Vector Machine (SVM), remain valuable and widely used in tree species classification [
6,
19], particularly when labelled samples are limited and well-designed predictor variables are available. However, their performance has been shown to vary across species classes, sensor types, study regions, and acquisition conditions [
6,
19]. In particular, these methods tend to perform well for dominant or spectrally distinct species but exhibit reduced accuracy for less represented or spectrally similar classes [
15,
16]. A key limitation is that these methods generally rely on manually engineered spectral, textural, structural, or vegetation-index features, and the quality of the classification is therefore strongly influenced by the choice and transferability of these features [
13,
27]. Features optimized for one dataset, sensor configuration, or forest condition may not generalize effectively to another, increasing data preparation requirements and potentially limiting model portability [
6,
19].
In contrast, CNNs perform feature learning directly from real-world imagery through convolutional filters, allowing the model to learn multi-scale patterns such as edges, textures, crown shapes, branching structure, and higher-level canopy characteristics that may be difficult to define manually [
13,
27]. This is particularly relevant for UAV-based individual tree species classification, where very high spatial resolution imagery contains fine-scale canopy patterns that can support discrimination among visually similar species [
1,
11,
14,
28]. Reviews of CNN applications in vegetation remote sensing have shown that CNNs frequently outperform shallow machine learning methods, largely because they can exploit spatial context and reduce the need for handcrafted feature engineering [
27]. Previous studies have also demonstrated that CNN-based approaches can achieve strong classification performance using UAV imagery and can remain robust under varying acquisition conditions when trained with diverse image samples [
11,
13].
Nevertheless, CNNs are not inherently superior in all situations. They typically require larger training datasets, greater computational resources, and careful tuning than RF or SVM, and overly complex architectures may overfit or underperform when sample sizes are small, and there is no significant data diversity [
13,
27]. Therefore, under limited labelled data conditions, CNNs are most appropriate when combined with strategies such as transfer learning, data augmentation, regularization, and careful architecture selection to improve generalization [
11,
13,
27]. In this study, CNNs were preferred to exploit learned spatial–spectral representations from tree-crown imagery and evaluate whether knowledge from a larger RGB source dataset could improve multispectral classification under data-scarce conditions. Conventional classifiers such as RF and SVM remain important benchmarks, but the proposed CNN-based framework was selected to support end-to-end feature learning and cross-domain transfer from high-resolution UAV RGB imagery to limited multispectral data.
Despite the advantages of CNN-based approaches, their operational scalability in UAV-based tree species classification remains limited. UAV data collection is constrained by coverage area, cost, and flight regulations, as training CNNs demands large, labelled datasets that are costly, time- consuming, and logistically difficult to obtain. Although considerable progress has been made in tree species classification using UAV-acquired RGB, multispectral, and LiDAR-based datasets, most studies focus on optimizing classification within a specific sensor type, season, platform, or data domain. In practice, UAV platforms are equipped with diverse sensors that differ in spatial resolution, spectral band configuration, radiometric response, and noise characteristics. Similarly, tree species classification studies using RGB and multispectral imagery have shown that these data sources contribute different types of information, with RGB imagery providing fine spatial and textural detail and multispectral imagery contributing additional spectral information [
18]. These variations introduce domain shifts between datasets, causing models trained on imagery from one sensor to learn sensor-specific features that may not generalize well to data acquired from another system, thereby limiting broader applicability and operational scalability. More broadly, domain adaptation studies in remote sensing have shown that distribution differences between source and target domains can arise from variations in sensor characteristics, imaging conditions, spatial resolution, and scene properties [
18]. Recent research has also emphasized that robustness and transferability across seasons, sites, larger spaces and sensors still remain a critical challenge for deep learning pipelines, in addition to the cost of training data creation (data collection, delineation and labelling) [
11]. These limitations highlight the need for cross-domain supervised transfer learning, where knowledge gained from one sensor (e.g., RGB imagery) is leveraged to improve performance on another sensor (e.g., multispectral imagery) with minimal additional training. Such strategies offer the potential to reduce dependence on large, labelled datasets, improve efficiency, and enable more robust, sensor-independent tree species classification.
Although transfer learning has been widely applied in remote sensing and vegetation classification, many existing approaches rely on models pretrained on general-purpose datasets such as ImageNet. Such pretrained models can improve performance when labelled data are limited, as demonstrated in UAV-based deciduous versus evergreen tree classification using winter orthomosaic imagery, where ImageNet-based transfer learning improved performance compared with training without transfer learning [
29]. However, the benefit of generic pretraining is not always consistent and may depend on the similarity between the source and target domains, the classification task, and the network architecture. For example, pretraining has been shown to provide limited or inconsistent improvement in tree seedling detection, with its effectiveness varying according to network complexity [
30]. Similarly, full training has been reported to outperform fine-tuning of ImageNet-pretrained backbones in remote sensing classification tasks [
31], while another study suggests that task-specific training may outperform generic pretraining when sufficient training data and computational resources are available [
32]. These findings indicate that features learned from generic image datasets may not always be optimal for vegetation-focused remote sensing applications.
Therefore, the present study adopts a more domain-relevant transfer learning strategy. Instead of relying solely on generic ImageNet-based pretraining, we adapt a DenseNet-121 model previously trained on a large, high-resolution UAV RGB tree-crown dataset acquired from the same study site. Although the source and target datasets were acquired using different sensors, they represent similar tree species, canopy structures, and site-specific ecological conditions. This makes the transferred representations more relevant to the target multispectral classification task and provides a more appropriate framework for supervised cross-domain and cross-modal transfer from UAV RGB imagery to UAV multispectral imagery.
In the context of tactical forest management, aerial imagery is routinely collected by provincial governments across Canada with the same spectral configuration as off-the-shelf UAV multispectral sensors, such as red, green, blue (RGB), red-edge, and NIR bands, making it compatible in terms of spectral content. Despite its operational utility, traditional aerial multispectral imagery has long struggled to support reliable individual tree species classification because its spatial resolution (typically 0.5–2 m) is too coarse to capture crown-level spectral and textural details, especially in dense boreal stands where crowns overlap, and mixed pixels are common [
33,
34]. These resolution constraints, combined with radiometric inconsistencies across airborne campaigns, have hindered species-level mapping [
33], thus restricting it to stand-scale assessment, generally completed through manual interpretation. The performance for automated tree species classification has been limited as well, primarily due to its coarser spatial resolution failing to capture species-specific crown texture [
33], the labour-intensive process of generating labelled training data [
34,
35], and the sensitivity of object-based methods to segmentation errors and mixed pixels [
36]. If models trained on high-resolution UAV multispectral imagery could be effectively adapted to classify species in aerial photographs, this would significantly expand the applicability of deep learning to forest inventories and ecological monitoring at larger scales.
In this study, we successfully developed a tree classification model for UAV-based multi-spectral imagery using domain adaptation to address limited labelled samples, with potential for scalability to aerial platforms with similar spectral configurations. This was achieved through the following contributions:
We develop and evaluate a baseline individual tree species classification model using UAV-acquired multispectral imagery with limited labelled samples, examining the role of increased spectral dimensionality under data-scarce conditions.
- 2.
Cross-domain transfer learning from RGB to multispectral imagery:
We investigate the potential of supervised cross-domain transfer learning by adapting a convolutional neural network (CNN) model trained on high-resolution UAV RGB imagery with a large, labelled dataset to lower-resolution multispectral imagery with limited labelled samples.
- 3.
Assessment of scalability to aerial imagery:
We assess the feasibility of scaling the adapted model to aerial platforms by applying it to down-sampled UAV multispectral imagery that simulates the spatial characteristics of multispectral aerial photographs used in regional forest inventories.
We demonstrated the contributions through application in parts of a complex mixedwood boreal forest in Canada.
5. Discussion
This study demonstrates that pretraining DenseNet-121 on a model trained using a large amount of RGB images provides a strong foundation for learning discriminative features in comparatively low spatial resolution multispectral imagery, even with limited training data to discriminate nine commercial tree species (of which five are coniferous and four are deciduous), in a complex mixedwood boreal forest. This is particularly important because the baseline multispectral-only models in
Section 3.1 showed only moderate performance, highlighting the challenges of training deep CNNs directly on data-scarce multispectral datasets.
The performance limitations observed in the multispectral-only models are likely influenced by two main factors: the coarser spatial resolution of multispectral imagery compared to RGB data and the limited size of labelled training samples. These constraints reduce the model’s ability to learn fine-grained and generalizable representations, especially in a complex mixedwood forest where species discrimination can be affected by subtle spectral differences, overlapping crowns, and class imbalance.
To address these limitations, this study explored cross-domain transfer learning by adapting a DenseNet-121 model pretrained on a large, high-resolution UAV RGB dataset. Despite differences in sensor modality and spatial resolution, the pretrained model provided a strong initialization for learning discriminative features in multispectral imagery with limited labelled data. The adapted model achieved stable convergence and improved classification performance, demonstrating the effectiveness of transfer learning for multispectral tree species classification under limited training data conditions.
The benefits of transfer learning were further evaluated on downsampled UAV multispectral imagery designed to simulate conventional airborne multispectral data with lower spatial resolution. The model maintained robust performance under these conditions, demonstrating its resilience across different spatial resolutions and its practical applicability for regional-scale forest inventories. Although the downsampled imagery does not replace validation using true airborne multispectral photographs, it provides an initial assessment of the model’s sensitivity to reduced spatial detail and its potential scalability to coarser-resolution aerial imagery.
Overall, these results highlight cross-domain transfer learning as an effective strategy for overcoming data scarcity and sensor limitations, offering a pathway toward scalable, sensor-independent, efficient, and operationally applicable tree species classification.
5.1. Performance of Supervised Domain Adaptation with RGB Pretrained Model
Classification performance of the RGB pretrained model adapted for the MS imagery is seen to be generally higher and more consistent for coniferous species than for deciduous species. Most conifers achieve relatively high recall (typically ≳0.78), with the notable exception of white spruce, which shows comparatively lower performance. In particular, red pine and eastern white pine demonstrate strong and stable classification. This likely reflects the pine-dominated nature of the study area, where greater representation of pine species enhances feature learning and generalization. In contrast, deciduous species exhibit more variable performance; red oak was well distinguished, while red maple showed substantial confusion across classes.
The misclassification patterns reveal a nuanced balance between intra- and inter-group confusion. Conifer errors are more structured and largely confined within the conifer group, as seen in the confusion of white spruce with balsam fir and red pine, indicating similarity in spectral responses. In contrast, deciduous species exhibit broader, less structured mixing across multiple classes. Notably, black ash is misclassified across a wide range of classes, suggesting that the confusion is broadly distributed rather than dominated by any single class. This pattern may be partly attributed to the impact of emerald ash borer infestation in the study area, which alters canopy condition and spectral response, reducing class separability. Although some inter-group confusion is present (e.g., eastern white pine misclassified as red oak), it remains comparatively limited.
Class-wise performance also appears to be influenced by training data representation. Species with relatively greater representation in the training data, such as red pine and eastern white pine, show more stable and higher recall, whereas underrepresented classes, particularly red maple, exhibit poorer performance despite the use of focal loss. This suggests that while loss reweighting mitigates imbalance to some extent, limited training samples still constrain the model’s ability to learn robust class-specific features. Overall, the model demonstrates strong discrimination for dominant conifers while highlighting persistent challenges in separating spectrally similar or underrepresented deciduous species.
5.2. Comparison to Existing UAV RGB and MS-Based Approaches
When compared with existing studies, results from UAV RGB imagery highlight the advantages of very high spatial resolution and explicit structural representation. For example, object-based CNN approaches integrating RGB imagery with 3D-derived information have reported accuracies as high as 93% for seven broad-level tree classes, benefiting from reduced intra-class variability [
12]. Similarly, another deep learning model applied to large-area UAV RGB datasets (~40 km
2) achieved overall accuracies of approximately 84% across multiple species in a subtropical forest using architectures designed to capture spatial context and small object features [
5]. However, such approaches typically rely on high-quality RGB data, explicit segmentation, or large training datasets, and often involve fewer or more aggregated class definitions.
In contrast, multispectral-based approaches introduce additional spectral information but often face challenges related to limited labelled data and reduced spatial resolution. For instance, studies using UAV multispectral imagery with Random Forest classifiers have reported F1 scores ranging from 0.69 to 0.83 in structurally simpler forest types and with fewer species, often relying on handcrafted spectral, textural, and structural features [
16]. Similarly, high overall accuracies close to 91% have been reported in wetland environments, though these are typically based on classification tasks involving only a small number of species (e.g., three), representing lower inter- class complexity [
15]. Another multi-source UAV study combining RGB, multispectral, and LiDAR data reported an overall accuracy of 83.98%, with species classification improving by 14–18% when multi-season data were used instead of single-season inputs [
6]. That study also showed the added value of vegetation indices, texture, and elevation features, but relied on substantially richer temporal and sensor information than considered here.
5.3. Comparison to Existing Multi-Source and Multi-Temporal Deep Learning-Based Approaches
More comparable deep learning-based approaches further highlight the effectiveness of the proposed method. Large-scale multi-source studies integrating aerial and satellite data (e.g., Sentinel-1 and Sentinel-2) have reported F1 scores of approximately 72% across a higher number of species (e.g., 15) but rely on substantially larger and more diverse training datasets [
7]. Likewise, UAV-based CNN approaches using EfficientNet architectures and extensive multitemporal datasets (e.g., >17,000 images across eight species) have achieved mean macro F1 scores around 75% [
18]. In contrast, the present study achieves similar overall accuracy (75%) using significantly fewer labelled samples and without temporal information, demonstrating the efficiency of cross-domain transfer learning. Overall, these comparisons indicate that the proposed approach performs competitively despite operating under more constrained conditions, including limited training data, the absence of time-series information, and higher species complexity. This underscores its practical value for operational forest inventory applications, where labelled multispectral datasets are often scarce and heterogeneous across sensors.
5.4. Performance of the Simulated Aerial Imagery Model
Classification performance of the MS model applied to the downsampled dataset showed a clear but expected decline relative to the original high-resolution imagery, with overall accuracy decreasing to 69%. This decline is also reflected in the macro-F1 score (0.615) and weighted F1 score (0.655), indicating uneven performance across species and comparatively stronger results for the more represented classes. This reduction reflects the loss of fine-scale spatial and textural information that is critical for species-level discrimination in UAV data. Nevertheless, the model retains a meaningful level of separability across several species, demonstrating a degree of robustness to spatial degradation.
Species-wise performance shows increased variability compared to that of the high-resolution MS imagery. Coniferous species continue to perform relatively well, particularly notable are red pine (recall 0.91) and eastern white pine (0.88), reinforcing the earlier observation that dominant pine species are more reliably classified, likely due to both their structural distinctiveness and stronger representation in the dataset. However, some conifers exhibit notable degradation; for instance, white spruce drops substantially (0.25) and is frequently confused with balsam fir and eastern white pine, indicating that reduced resolution exacerbates spectral similarity within conifer groups. Deciduous species show a more pronounced decline in performance. Red maple is particularly affected (0.20), with widespread confusion across multiple species, while black ash (0.70), although still moderate, continues to exhibit dispersed misclassification patterns. The misclassification patterns reveal an intensification of both intra- and inter-group confusion. While conifer errors remain partly structured within the group (e.g., white spruce with balsam fir), cross-group confusion increases under downsampling, with several deciduous species misclassified as conifers and vice versa. This suggests that the loss of spatial detail reduces the model’s ability to leverage crown structure, forcing greater reliance on spectral cues that are less distinctive across species.
5.5. Comparison to Existing Aerial Imagery-Based Approaches
When compared with studies conducted on true airborne data, the observed performance differences should be interpreted considering substantially different data requirements and problem settings rather than as a direct benchmark. Airborne studies have shown that classification performance is strongly influenced by sensor richness, spatial resolution, structural information, and class complexity. For example, airborne hyperspectral imagery combined with LiDAR has achieved high kappa accuracies for general macro-classes, forest types, and individual species, with reported values of 93.2%, 82.1%, and 76.5%, respectively. However, when the spectral data were downgraded from hyperspectral to multispectral imagery, classification accuracy decreased, particularly for single-species classification, although performance remained relatively high for broader forest-type and macro-class mapping [
34]. The same study also showed that high-density LiDAR provided more useful structural information than low-density LiDAR when combined with either hyperspectral or multispectral data. Similarly, a full-waveform LiDAR study has shown that classification performance depends strongly on class complexity, with higher accuracies obtained when the task is simplified from multiple tree species to dominant species or broader coniferous/broadleaved groups [
36].
A study using 30 cm aerial imagery combined with airborne LiDAR reported accuracies of approximately 78%, and approximately 73% on independent validation, for nine species using DenseNet-based models [
2]. However, these results were achieved for tree species composition mapping, not individual tree-level classification, and relied on an extensive reference dataset comprising approximately 614,582 samples derived from more than 250 aerial images and 354 interpreted sites. Similarly, another large-area aerial mapping study trained and tested nine CNN models using combinations of three training datasets and three architectures, VGG16, ResNet50v2, and DenseNet121, with multiband aerial photographs and a LiDAR-derived canopy height model. The final super-ensemble was evaluated using 1311 independent forest inventory plots and used inter-model agreement to generate spatial uncertainty maps [
3]. These examples demonstrate that high-performing airborne approaches are often supported by extensive labelled datasets, explicit structural information, multiple model architectures, and ensemble prediction strategies, all of which can represent major operational challenges.
In contrast, the present study focuses on individual tree classification using limited labelled UAV multispectral data without access to explicit 3D structural inputs. The aerial imagery component was also based on downsampled UAV multispectral imagery designed to simulate the spatial characteristics of conventional airborne multispectral data, rather than true airborne acquisition geometry. Therefore, the downsampled experiment should be interpreted as an initial assessment of model sensitivity to reduced spatial resolution and potential scalability, rather than as a direct comparison with true airborne imagery studies. Importantly, rather than directly competing with such data-intensive approaches, the results highlight the potential of the proposed framework to reduce dependence on large annotated airborne datasets. Despite limited target-domain labels, reduced spatial detail, and the absence of structural inputs, the model maintained reasonable performance for dominant species, particularly conifers such as red pine and eastern white pine. This suggests that cross-domain transfer learning captures features that remain partially invariant to spatial resolution changes. This positions the approach as a practical and cost-effective strategy, where UAV data can be leveraged to pretrain models and support future tree species composition mapping workflows, ultimately reducing the need for extensive manual interpretation and large labelled datasets in airborne applications.
5.6. Training Stability and Generalization
Regarding the model training, the consistently smooth and stable training and validation curves indicate that the pretrained DenseNet-121 learned generalizable representations rather than memorizing training samples. The small gap between training and validation accuracy, together with the convergence of loss curves, suggests effective regularization and minimal overfitting, even in the presence of an imbalanced dataset. This is particularly important for ecological and forestry applications, where collecting large, well-balanced MS datasets is often impractical. The results suggest that a small number of labelled samples, when combined with appropriate pretraining and regularization strategies, can be sufficient to train reliable deep learning models. Although the model was trained for a fixed number of epochs to comprehensively assess its convergence behaviour, improvements in validation performance became marginal during the later stages of training. Future implementations could incorporate an early stopping strategy with a defined patience parameter to improve computational efficiency while maintaining model performance.
5.7. Impact of RGB-Based Pretraining
A key observation from this work is the substantial performance gain achieved through RGB-based pretraining. When pretrained weights were not used, and data augmentation was omitted, classification accuracy on MS imagery dropped sharply, indicating that learning from scratch is inadequate under limited-sample conditions. In contrast, initializing the network with RGB-pretrained weights significantly improved accuracy, confirming the importance of transfer learning. Although RGB and MS data differ spectrally, the pretrained network appears to transfer low- and mid-level spatial features (e.g., texture, edges, and structural patterns) that remain relevant across domains. This supports the use of RGB pretraining as a practical form of domain adaptation for MS classification tasks, particularly when labelled MS data are scarce.
5.8. Evaluation of Loss Functions
The comparison between focal loss and categorical cross-entropy loss reveals that both loss functions benefited from pretraining and regularization. While focal loss is theoretically well-suited for imbalanced datasets, the observed performance differences between focal loss and categorical cross-entropy were relatively small once dropout, L2 regularization, and data augmentation were applied. This suggests that, in this setting, the benefits of focal loss may be partially absorbed by other regularization mechanisms, especially when strong pretrained features are available. Nonetheless, focal loss contributed to stable optimization and helped mitigate class imbalance during training.
5.9. Role and Limits of Data Augmentation
Data augmentation played a critical role in improving overall performance, increasing accuracy from approximately 66%–69% (without augmentation) to 75% when augmentation was applied. This confirms that synthetic variability can partially compensate for the limited number of training samples and the challenges posed by high spatial resolution (30 cm) MS imagery. However, further increasing the number of augmented samples (five augmented images per original image) did not yield additional performance gains and instead led to saturation or slight degradation in accuracy (from 75% to 74%). This suggests that excessive augmentation may introduce redundant or less informative samples, offering diminishing returns once the model has learned the dominant spatial and spectral patterns. These findings highlight the importance of balancing augmentation size with dataset size and variability.
5.10. Operational Scalability and Cost Considerations
In operational forestry, the dominant cost drivers are field data collection, manual interpretation, and repeated airborne acquisitions. In the Canadian boreal context, even basic field inventory plots typically cost
$250–500 per plot (based on operational experience in Canada). Generating species labels for remote sensing workflows requires additional expert time, specifically to delineate and label individual crowns by an experienced geomatician with forestry and field knowledge, which could add another
$2 per crown, approximately (based on recent operational experience over large areas). At landscape scales, these annotation requirements quickly exceed the cost of the imagery itself. Although airborne multispectral surveys can be effective for district or provincial scale applications, their acquisition costs also remain expensive, averaging approximately
$75 per km
2, whereas UAV acquisitions can be conducted for less than
$2 per ha [
46,
47], making them substantially more accessible for local, regional, or project-level mapping.
The cross-sensor transfer learning approach evaluated in this study has the potential to directly address these structural cost constraints. By pretraining on high-resolution UAV RGB imagery and adapting the model to lower-resolution multispectral data with only limited target-domain labels, the method reduces both the number of required field plots and the volume of crown-level annotations. Although the adapted model achieved a moderate overall accuracy (75%) relative to the highest-performing single-sensor deep learning studies [
11,
18] or the sensor-specific models [
25], this level of performance is comparable to, or better than, many manual interpretation workflows currently used in operational forestry mapping. More importantly, the approach reduces labelling effort and enables the use of heterogeneous imagery sources. If the downsampled MS data could be replaced with conventional airborne photographs, there is a strong potential to further reduce the overall cost of producing species-level maps over broader geographic areas. This would also enhance the level of detail from stand-level classification to individual tree-level mapping.
For agencies responsible for large, remote, and heterogeneous forests at operational scales, the trade-off of slightly lower accuracy can be outweighed by gains in cost efficiency, repeatability, and scalability. The results suggest that cross-sensor transfer learning provides a viable pathway toward sensor-independent, lower-cost species mapping, enabling more frequent updates and broader geographic coverage without the prohibitive expenses associated with traditional inventory methods.
5.11. Future Research Directions
These findings reinforce the notion that large, easily acquirable RGB datasets can be leveraged to overcome data scarcity in MS satellite and aerial imagery, supporting the secondary objective of domain generalization across spectral modalities and spatial resolutions. Overall, the results demonstrate that although performance degrades with reduced resolution, the model maintains reasonable accuracy for dominant species, particularly pines. However, the increased confusion among underrepresented and spectrally similar species highlights the need for further adaptation when transferring to operational airborne imagery. In this study, the model was evaluated in a zero-shot transfer setting, where weights learned from high-resolution UAV multispectral data were directly applied to simulated aerial-scale inputs. While this provides a useful measure of robustness, performance is likely constrained by the lack of adaptation to resolution-induced feature shifts. As a next step, fine-tuning the pretrained DenseNet-121 on downsampled imagery, even with limited samples, could help recalibrate spatial filters and better align feature representations with coarser canopy structure, thereby reducing intra-class confusion, particularly among similar coniferous and underrepresented deciduous species. A further extension could be a multi-resolution learning framework, where the model is trained on both high-resolution UAV data and lower-resolution (simulated or real) airborne data. This would help the model learn features that are more stable across different spatial scales. This idea can be further improved using domain adaptation methods, such as aligning features between high and low-resolution data. A practical next step can be a simulated-to-real transfer pipeline, where downsampled UAV data acts as a bridge between UAV and real airborne multispectral imagery used in operational forest inventory systems. This would allow more realistic testing of model performance in deployment settings. Finally, the present study was designed to evaluate CNN-based baseline modelling and supervised RGB-to-multispectral domain adaptation. Future work could extend this framework by incorporating benchmark comparisons with conventional machine learning classifiers, such as Random Forest, which are widely used in tree species classification. When implemented with appropriate feature extraction, model tuning, and validation procedures, such comparisons would provide additional context for assessing the relative advantage of the proposed transfer-learning framework.
5.12. Summary
Overall, the results indicate that the proposed training strategy successfully overcomes two major challenges in MS image classification: limited sample size and differences in spatial resolution and spectral characteristics. By leveraging RGB-pretrained DenseNet-121 models, effective regularization, and moderate data augmentation, the model achieves stable learning dynamics and robust performance without requiring large MS datasets.