MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds

Vahrenhold, Jan Richard; Brandmeier, Melanie; Müller, Markus Sebastian

doi:10.3390/rs17071304

Open AccessArticle

MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds

by

Jan Richard Vahrenhold

^*

,

Melanie Brandmeier

and

Markus Sebastian Müller

Faculty of Plastics Engineering and Surveying, Technical University of Applied Sciences Würzburg-Schweinfurt, Röntgenring 8, 97070 Würzburg, Germany

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1304; https://doi.org/10.3390/rs17071304

Submission received: 1 March 2025 / Revised: 30 March 2025 / Accepted: 3 April 2025 / Published: 5 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Trees play a critical role in climate regulation, biodiversity, and carbon storage as they cover approximately 30% of the global land area. Nowadays, Machine Learning (ML)is key to automating large-scale tree species classification based on active and passive sensing systems, with a recent trend favoring data fusion approaches for higher accuracy. The use of 3D Deep Learning (DL) models has improved tree species classification by capturing structural and geometric data directly from point clouds. We propose a fully Multimodal Tree Species Classification Network (MMTSCNet) that processes Light Detection and Ranging (LiDAR) point clouds, Full-Waveform (FWF) data, derived features, and bidirectional, color-coded depth images in their native data formats without any modality transformation. We conduct several experiments as well as an ablation study to assess the impact of data fusion. Classification performance on the combination of Airborne Laser Scanning (ALS) data with FWF data scored the highest, achieving an Overall Accuracy (OA) of nearly 97%, a Mean Average F1-score (MAF) of nearly 97%, and a Kappa Coefficient of 0.96. Results for the other data subsets show that the ALS data in combination with or even without FWF data produced the best results, which was closely followed by the UAV-borne Laser Scanning (ULS) data. Additionally, it is evident that the inclusion of FWF data provided significant benefits to the classification performance, resulting in an increase in the MAF of +4.66% for the ALS data, +4.69% for the ULS data under leaf-on conditions, and +2.59% for the ULS data under leaf-off conditions. The proposed model is also compared to a state-of-the-art unimodal 3D-DL model (PointNet++) as well as a feature-based unimodal DL architecture (DSTCN). The MMTSCNet architecture outperformed the other models by several percentage points, depending on the characteristics of the input data.

Keywords:

multimodal; deep learning; classification; LiDAR; point clouds; tree species; remote sensing; geospatial AI; UAV

1. Introduction

Covering approximately 30% of the global land area, forests play a major role in Earth’s ecosystem functions, providing carbon storage, habitats for countless species, as well as biogeophysical and biogeochemical management [1,2]. Protected woodland areas around the world are able to store 61.43 Gt of above-ground carbon, which is equivalent to one year of global fossil fuel emissions (data from 2020) [3]. Additionally, forests greatly impact microclimatic conditions by exerting a cooling effect due to evapotranspiration. Conservation, reforestation, and continuous monitoring of forests are becoming increasingly important for decision makers. Additional factors that require intensive monitoring of forest health and tree species distribution on a local scale include pest infestations [4], the dependence of locally occurring animal species on tree species distribution [5], as well as climate-dependent hazards such as wildfires, droughts, and hydrological hazards [6]. While forest monitoring was mainly conducted manually in the past [7], the emergence of remote sensing technologies has drastically transformed forest monitoring by eliminating the spatial limitations of in situ sampling [8] and has generally led to a much wider availability of data and new approaches in forest monitoring. Automated tree species classification is crucial for cost- and time-efficient collection of forest inventory data and is extremely important as an input for further models [9] that depend on species-specific parameters or results.

Remote sensing platforms such as satellites, Unmanned Aerial Vehicles (UAVs), and other airborne systems facilitate large-scale and high-resolution data acquisition. This widespread availability of geospatial data has catalyzed the development of increasingly sophisticated models, enabling the exploration of complex spatial patterns and species-specific structural traits. For tree species classification, this shift implies a transition from manual or rule-based analyses toward automated, data-driven, Machine Learning (ML), and Deep Learning (DL) approaches, which leverage multimodal information to improve classification accuracy, scalability, and ecological interpretability [9,10,11,12]. While multimodal approaches have gained substantial traction due to their ability to integrate complementary information sources, unimodal methods continue to yield promising results.

Most unimodal approaches operate on features derived from Light Detection and Ranging (LiDAR) point clouds, spectral characteristics from imagery, or fixed sets of rules based on expert knowledge [13,14,15,16]. These approaches often rely on expert-based feature engineering to describe tree structure, height profiles, or intensity-based metrics. While these methods achieve good results, with overall accuracies ranging from 75% to 98% [13,14,15], the number of species classified is often limited due to a lack of generalization across varying forest types, acquisition conditions, and species-specific structural complexity. Additionally, handcrafted features are typically dataset-specific and may not transfer well to heterogeneous forest environments or different sensor configurations. This, in combination with the need for expert knowledge, constrains the scalability and ecological interpretability of unimodal approaches and limits their suitability for operational forest monitoring tasks that require broad species coverage and robustness across various phenological stages.

Recent progress in DL frameworks have facilitated the development of deep unimodal model architectures, such as PointNet [17], its successor PointNet++ [18] for point clouds, and EfficientNetV2 [19], which are capable of operating directly on minimally processed sensor data, including raw point clouds [20,21,22,23] and high-resolution imagery [24,25,26]. Unlike earlier unimodal approaches that rely heavily on expert-based, domain-specific features, these architectures learn hierarchical spatial and contextual representations from nearly unprocessed data, reducing the need for manual feature engineering and expert knowledge. While overall accuracies achieved by such models are often comparable or slightly lower than those reported by expert-knowledge-driven methods, their ability to generalize across diverse forest structures, acquisition conditions, and sensor platforms represents a significant advancement. This shift holds particular relevance for large-scale ecological applications, where variability in species morphology, forest density, species distribution, and phenological conditions can undermine the effectiveness of rigid, rule-based systems.

To overcome the inherent limitations of unimodal systems, recent research has increasingly explored multimodal approaches that integrate complementary data sources—typically combining structural information from LiDAR with spectral cues from optical imagery [10,11,12,27]. These methods leverage the fact that different sensor modalities capture distinct yet synergistic aspects of forest structure and composition. LiDAR provides detailed geometric and, in some cases, radiometric data, while spectral imagery contains biochemical and phenological signals. The fusion of such modalities has been shown to significantly improve classification performance, particularly in structurally complex or species-rich environments [11,13,28].

Various fusion strategies have been employed in recent studies: Ferreira et al. [29] aligned airborne LiDAR with high-resolution RGB and Near-Infrared (NIR) imagery to classify urban tree species using a dual-branch ResUNet model, achieving F1-scores of 0.737 across six species. Reisi Gahrouei et al. [27] demonstrated that multimodal patch-based classification with DenseNet and Swin Transformer achieved higher accuracy than unimodal models, particularly when using larger spatial contexts in the form of image patches. Liu et al. [28] proposed a hybrid architecture (TSCMDL) combining PointMLP and ResNet-50 to process raw point clouds and RGB imagery in parallel, with performance gains of up to +4% F1-score over unimodal baselines. Zhang et al. [13] introduced DSTCN, a feature-engineering-based multimodal approach that converts LiDAR point clouds into histogram descriptors, achieving 94% OA across seven species.

Despite these advancements, most existing multimodal frameworks suffer from one or more of the following limitations: (i) Modalities are often transformed into common intermediate representations, risking loss of modality-specific information; (ii) Fusion is commonly performed either too early (input level) or too late (decision level), limiting the model’s capacity to learn cross-modal interactions; (iii) Most architectures lack a dynamic mechanism to adjust modality weighting, which depends on data quality, species morphology, or environmental conditions. As a result, the full potential of multimodal data is lost.

To address some of the previous shortcomings, we propose a fully multimodal deep learning architecture, Multimodal Tree Species Classification Network (MMTSCNet), which processes either airborne or UAV-based LiDAR point clouds, numerical features derived from individual LiDAR tree point clouds and the corresponding Full-Waveform (FWF) tree point clouds, as well as bidirectional, color-coded depth images of individual trees. Each modality is handled by a dedicated feature extraction branch designed to retain data-specific characteristics. A novel Dynamic Modality Scaling (DMS) module is introduced to adaptively learn the relative importance of each modality during training. By preserving native data structures and incorporating modality-aware fusion, our architecture is capable of robust and scalable tree species classification across structurally heterogeneous forest environments, phenological conditions, and acquisition platforms.

2. Materials and Methods

In this section, we describe the datasets used, our study design, and the proposed novel network architecture—MMTSCNet.

2.1. Study Sites and Datasets

We used two published benchmark datasets for developing and testing our model: The datasets [30,31] were published on the PANGAEA data platform [32] in 2022 and are thoroughly described and documented in Weiser et al. [33]. The datasets include twelve Central-European study plots, which feature mixed forests and are located in the federal state of Baden-Württemberg, Germany, near the cities Karlsruhe (49°02′05.63″N, 8°25′34.03″E) and Bretten (49°00′49.64″N, 8°41′41.39″E) as shown in Figure 1. Both study areas, influenced by continental as well as temperate oceanic climate, experience annual precipitation of 810 L/m² [34], and their elevations are approximately 160 m and 236 m above sea level, respectively.

In the following, we will briefly describe the datasets and the processing applied by Weiser et al. [30,31]. For each of the twelve individual mixed forest plots, separate acquisition campaigns were conducted to obtain ALS, ULS, Terrestrial Laser Scanning (TLS), forest inventory, and FWF data [33]. The tree species distribution within the acquired data is composed of 21, 20, and 20 different broad-leaved and coniferous species for the ALS data, ULS data under leaf-on conditions, and the ULS data under leaf-off conditions, respectively, with Fagus sylvatica, Pinus sylvestris, Picea abies, Quercus petraea, Quercus rubra, Carpinus betulus, and Pseudotsuga menziesii being most widely represented. Table 1 summarizes the distribution of coniferous and broad-leaved species present in the original datasets. Given the pronounced species imbalance evident in the dataset, which poses a risk of inducing bias in our proposed model architecture, we excluded all species contributing less than 5% to the total dataset, resulting in the aforementioned seven most widely represented species being used for our experiments.

In order to obtain individual point clouds for each tree, Weiser et al. [30] applied multiple segmentation processes. They first segmented individual tree point clouds from the TLS data using Euclidean clustering and a competitive Dijkstra algorithm. Independent editors were then tasked with the supervision and refining of automated segmentation results. The individual tree point clouds for the ULS and ALS data were mainly extracted manually. The previously extracted individual TLS tree point clouds were used to automatically segment individual trees from ULS point clouds using a k-Nearest Neighbors (NN) approach in the form of a KDtree. Weiser et al. [33] noted that it is important that the source point cloud has a similar or slightly higher number of points than the queried point cloud in order to prevent extrapolation. Because of this principle, Weiser et al. [33] only used the segmented individual ULS point clouds for the automated segmentation of ALS point clouds. The following Table 2 shows the final distribution of relevant data for the individual plots that were used for our experiments.

As a byproduct of the ALS campaign, full-waveform data for the individual plots were also captured and made available on the PANGAEA [32] data platform as a separate dataset [30]. The FWF data are composed of LiDAR point clouds for each flight strip, which are clipped to the extents of the ALS data for each forest plot and their corresponding Wave-Packet Descriptor (WPD) files. These data were used in combination with the afore described point clouds for our multimodal approach.

2.2. Data Preprocessing

Extensive preprocessing was conducted to create our input data and to improve model performance. The full preprocessing pipeline is depicted in Figure 2. In the first step, the source data were divided into three subsets based on the acquisition sensor platform and the leaf condition (ALS leaf-on, ULS leaf-on, ULS leaf-off). For each georeferenced LiDAR tree point cloud in the three data subsets, the corresponding georeferenced and merged FWF flight strip point cloud was queried using a k-NN search with

k = 10

. This approach enabled the extraction of individual FWF tree point clouds aligned with the pre-segmented LiDAR trees, thereby allowing for the retrieval of detailed radiometric information at the tree level.

All point clouds with fewer total points than the resampling target number of points required by the network (2048) are eliminated from the dataset to prevent excessive upsampling during the preprocessing, which would lead to artificial noise in the point cloud data. This step is based on the findings by Fan et al. [15], who suggest a resampling target number of points of 2048 points in order to retain a balance between information and computational cost. Additionally, point clouds were excluded as outliers if their height deviated from the species-specific mean height (

h_{m e a n}

) by more than 85% of the species’ standard deviation. The 85% threshold represents a conservative criterion for outlier detection, excluding only those individuals with extreme heights while preserving the majority of the population distribution, which is assumed to approximate a normal or near-normal distribution. The remaining LiDAR and FWF point clouds were partitioned into 90% training and 10% test sets using a stratified shuffle split to preserve the original class distribution in both subsets. This ensures that further preprocessing steps (described in the following) are applied independently, preventing information leakage during the training stage of our proposed model.

2.2.1. Point Cloud Augmentation

To ensure a sufficiently large and balanced dataset for training the proposed model architecture, data augmentation techniques were applied to increase the number of individual tree point cloud samples. Augmentation was guided by the class distribution of the original dataset to reduce class imbalance at the initial training stage, resulting in the augmented sample counts reported in Table 3. The augmentation pipeline included random rotations (1°–359°), isotropic scaling within a range of

1 \pm 20 %

, random shuffling of point order, per-point jittering with a scale of

1 \pm 0.045

, random mirroring along the X-Z and Y-Z planes, and the addition of Gaussian noise with a standard deviation of 0.005. This early augmentation rendered any further augmentation of additional data obsolete, as any other type of data used as an input for our proposed model architecture was generated using the augmented LiDAR and FWF point clouds.

2.2.2. Generation of Bidirectional Depth Images

As 2D-Computer Vision (CV) models have the potential to enhance the classification performance of 3D-DL algorithms, color-coded bidirectional depth images were generated for each individual tree point cloud to capture large-scale spatial relationships. To this end, each LiDAR tree point cloud was first voxelized using a fixed voxel size of 0.04, referenced to the original scale of the point cloud. To ensure a consistent representation across the dataset, a normalization step was introduced based on the spatial extent of each point cloud. Specifically, a scale factor S was computed to normalize the point cloud size relative to the largest bounding box dimension observed across the entire dataset. The bounding box dimension of an individual point cloud was defined as

max (x_{bbox}, y_{bbox}, z_{bbox})

(1)

and the global maximum bounding box dimensions across all n samples was computed as

{max}_{i = 1}^{n} max (x_{bbox (i)}, y_{bbox (i)}, z_{bbox (i)})

(2)

The resulting scale factor S was then defined as follows:

S = \frac{max (x_{bbox}, y_{bbox}, z_{bbox})}{{max}_{i = 1}^{n} max (x_{bbox (i)}, y_{bbox (i)}, z_{bbox (i)})}

(3)

This normalization ensured the accurate representation of varying tree sizes. Each scaled voxel grid was then projected onto two empty arrays of 224 × 224 pixels: one after a 90° rotation around the z-axis (frontal view), and another after a 90° rotation around the x-axis (top-down view).

Combined with the correct z-axis scale according to the previously determined scale factor S, this procedure ensured that relevant height information within the different tree species was translated into the image projections. Additionally, for each pixel, the corresponding number of voxels in the third dimension was determined and written into the image array. The resulting image arrays were normalized to values between 0 and 255, allowing for a color-coded, bidirectional image representation of each individual point cloud as shown in Figure 3.

2.2.3. Extraction of Numerical Features

As feature engineering has been shown to enhance classification performance, we applied a comprehensive set of feature engineering techniques to extract numerical descriptors from the individual tree point clouds. The selected numerical descriptors were informed by previous studies, including Lin and Hyyppä [16], Li et al. [36], Michałowska and Rapiński [37], Guo et al. [38], Hovi et al. [39], Shi et al. [40], and Shi et al. [41] (see Table 4 and Table 5 for an overview of the selected descriptors). In addition to the color-coded bidirectional image representations, these engineered features were derived from the previously segmented individual ALS, ULS, and FWF tree point clouds.

Depending on the subset of data to be processed (compare Table 3), both the FWF point cloud and the regular point cloud, or just the regular point cloud, were processed. For subsets where the FWF data were used, 58 different numerical features were calculated; for subsets without FWF data, only 50 different numerical features were created.

2.2.4. Feature Selection

Finally, features with low contributions to the classification task were eliminated using an Random Forest (RF) classifier [42]. The RF classifier was trained on the complete feature set, inherently yielding an importance score for each individual feature [42]. Features with an importance score of less than 0.05 were eliminated from the dataset, resulting in a total of 50 features for the data subsets with FWF data and 42 features for the data subsets without FWF data, respectively. The eliminated features across the subsets of data with FWF data available were Leaf Inclination, Leaf Curvature, Crown Asymmetry, Canopy Cover Fraction, Crown Symmetry, Canopy Ellipticity, Height Variation Coefficient, Branch Density, Echo Width, Main Component (PCA) Eigenvalue 2 (

λ_{2 / 3}

), Crown Curvature and the Gini Coefficient for Height Distribution.

In a final preprocessing step, the point clouds used to generate the color-coded, bidirectional images and numerical features were resampled to match the input shape of the network architecture proposed in this paper. For this purpose, the commonly employed Non-Uniform Grid and Farthest Point Sampling (NGFPS) algorithm [18] was used to downsample any point cloud with more than 2048 points to exactly 2048 points. The resulting dataset was split with a ratio of 70% training data and 30% validation data using a stratified shuffle split.

2.3. MMTSCNet Architecture

The proposed MMTSCNet relies on parallelized feature extraction and multimodal 3D classification of tree species (Figure 4). While many current state-of-the-art model architectures exclusively rely on engineered features [13,21] or merely act on point cloud data [11,28], MMTSCNet was developed to bridge the gap between both approaches and allow for the processing of various data types in their native format, thus preventing information loss and ultimately achieving similar or even better classification results. It consists of four individual branches, each responsible for processing a different type of data. The features extracted during this process are finally concatenated, weighted using the DMS module, and classified by a classification head. In the following subsections, we describe the individual branches and layers in more detail.

2.3.1. Point Cloud Extractor Branch

The first branch of our proposed architecture is responsible for extracting features from the resampled point clouds. To achieve this, an approach loosely based on the PointNet [17] and PointNet++ [18] architectures was chosen. While PointNet++ has proven to be more accurate in its predictions than PointNet [18], the added complexity increases the risk of overfitting and imposes higher computational demands, necessitating more advanced hardware for effective training. Since the proposed MMTSCNet architecture is composed of multiple individual architectures, less complex but still potent architectures were chosen for the individual branches in an effort to reduce the computational effort needed to tune and train the resulting model.

As highlighted in Figure 4, the Point Cloud Extractor (PCE) first applies Multi-Scale Grouping (MSG) to the received input point cloud with 2048 points, of which each contains 3 dimensions (x, y, z). A residual is passed through a single Conv1D-Layer while the input passes through a first convolution block before both the extracted features and the residual are added back together. All convolution blocks in the PCE feature LayerNormalization to be able to process small batches of data while not suffering from instability and a Swish activation function which prevents neurons from deactivating. Two more convolution blocks follow, extracting more features before a HybridPool layer is applied. The HybridPool layer is a combination of a GlobalMaxPooling1D layer and a GlobalAveragePool1D layer and is employed to keep both large-scale and fine-grained patterns in the extracted features through a final concatenation step.

The following illustrative example of the PCE operation demonstrates the internal operations of the PCE for a single input sample, which consists of a point cloud with shape

(2048, 3)

, representing 2048 points in a three-dimensional space. For MSG, each point is grouped with its 24 nearest neighbors within the radii

[0.055, 0.135, 0.345, 0.525, 0.695]

, as determined during the hyperparameter tuning (See Table 6). After applying a Conv1D-Layer with 256 filters to the grouped local regions, each scale produces a tensor of shape

(2048, 256)

. The resulting five tensors are concatenated into a

(2048, 1280)

tensor. A residual pathway with a separate Conv1D-Layer outputs a tensor of shape

(2048, 256)

, which is added element-wise to the processed features. Two additional Convolution blocks (As shown in Figure 4) reduce the combined features to

(2048, 512)

. Finally, the HybridPool layer yields a global feature vector, combining both GlobalMaxPooling1D and GlobalAveragePool1D pooled features. This salient feature vector of shape

(1, 1024)

is passed on to the DMS.

2.3.2. 2D Feature Extraction Branches

The second and third branches of the model are two image processing pipelines for processing the bidirectional, color-coded depth images of each point cloud. Both image processing branches are identical and composed of pretrained instances (ImageNet [43]) of the EfficientNetV2S architecture [19]. The EfficientNetV2S architecture was selected for the image processing branches as it surpassed other popular architectures such as DenseNet [44] on the ImageNet benchmark dataset when it was published [19] while also having only a fraction of the trainable parameters of other state-of-the-art models and thus requiring less computational resources to train. Twenty layers of the EfficientNetV2S instances were set up to be trainable, while the remaining layers stayed frozen. The extracted features are combined in a single feature vector and passed on to the classification head.

2.3.3. Numerical Feature Extraction Branch

The final branch of our proposed model architecture MMTSCNet is responsible for the extraction of features from the previously generated numerical features. The Metrics Model (MM) is loosely based off of a Multi-Layer Perceptrons (MLPs) structure with the major difference of residual connections between each of the fully connected dense blocks. It is composed of four dense blocks, each featuring a Dense layer, followed by a LayerNormalization. Analogous to the PCE, LayerNormalization was employed in the MM to allow the processing of smaller batches of data. The subsequent layer in the dense blocks is an Activation layer using the Swish activation function to prevent the deactivation of neurons and to stabilize the gradients during training. The last layer of each dense block is a Dropout to prevent overfitting. Between each of the dense blocks, features are copied and passed on in parallel. They are then projected to the same shape as the features that pass through the dense block and are added back into the feature stream. These residual connections counteract the problem of vanishing gradients, allow for better feature reuse, and lead to faster convergence of the model [45].

2.3.4. Classification Head

The classification head is the final element of MMTSCNet and consists of a series of dense blocks with residual connections (identical in structure to the dense blocks used in the MM) as well as an adaptive feature weighting mechanism (DMS) that dynamically adjusts the contribution of each modality to the classification task at hand. Prior to classification, the DMS allows MMTSCNet to actively learn and weigh the outputs of the individual branches in order to maximize the classification performance. While the DMS is active, weights for each of the branches are generated and applied to Dense projection layers prior to concatenation and further processing. The resulting weighted feature vector is subsequently passed on to a series of dense blocks and the final Dense layer with a Softmax activation function.

2.4. Hyperparameter Tuning and Training

Hyperparameter tuning was performed on all subsets of data using the Keras-Tuner (v. 1.4.7) python package [46]. The tuner of choice was the BayesianOptimization tuner with maximum validation accuracy as the objective, and the tuning was conducted over 10 trials with 15 epochs for each subset of data. Based on the hyperparameters determined during the tuning process (compare Table 6), MMTSCNet has

\approx 4 M

trainable parameters and used a learning rate of

0.00001

. We also tested different dropout rates and L1/L2 regularization for each subset of data but opted for a dynamic approach, tuning these values independently for each dataset instead of selecting fixed values. We used a workstation equipped with an AMD Ryzen 9 9950X 16-core 32 thread CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA), 128 GB of RAM (Corsair Gaming, Inc., Fremont, CA, USA) and an NVIDIA RTX 4090 with 24 GB of VRAM (NIVIDIA, Santa Clara, CA, USA) for all experiments. The median power consumption of MMTSCNet during tuning and training was estimated to be at around 12.6 kWh for 18 h of tuning and training at 700 Watts of power consumption.

The training of MMTSCNet on all the available subsets of data (8 total) was conducted over a maximum of 150 epochs with a batch size of 8. MMTSCNet was trained using the focal loss [47] to address class-imbalance in the dataset. During the training process, the initial learning rate is increased by a factor of

1.01

for the first five epochs if the learning rate is lower than

0.0001

. If the learning rate is higher than

0.0001

during the first 5 epochs, it is increased by a factor of

1.002

. The learning rate is reduced by a factor of

0.98

after epoch five until it reaches

5 \times 10^{- 7}

. The initial increase in the learning rate was implemented to allow the model to escape poor initial weights and explore the parameter space more quickly during the first epochs of the training. The constant decrease in the learning rate after epoch 5 prevents an overshooting of the optimal parameters and stabilizes the learning process as it approaches convergence. In order to further prevent overfitting, a callback for early stopping was used, which halts the training progress and restores the best values if the validation accuracy of the model has not improved over the last 12 epochs. While the number of epochs varied greatly for each subset of data, all configurations trained for 50 to 65 epochs before the early stopping callback was triggered.

2.5. Other Architectures for Evaluation

To rigorously assess the classification performance of the proposed MMTSCNet architecture, we benchmark it against two representative baseline models trained on the same dataset [30]: PointNet++ [20] and DSTCN [13]. These models were selected to represent two widely used but fundamentally distinct paradigms in tree species classification from LiDAR data. PointNet++ is a deep learning architecture designed for direct operation on unstructured point clouds, employing hierarchical feature extraction and MSG strategies to capture local and global geometric patterns. DSTCN, by contrast, relies on a feature-engineering pipeline that transforms raw point clouds into histogram-based descriptors, encoding intensity, height, and geometric statistics, which are then processed via deep convolutional layers.

Beyond PointNet++ and DSTCN, our evaluation also draws context from related uni- and multimodal studies such as those by Liu et al. [28], who combined ResNet-50 and PointMLP for RGB image and point cloud fusion; Reisi Gahrouei et al. [27], who employed patch-based CNNs and transformer models for multispectral and LiDAR fusion, and Ferreira et al. [29], who developed an early fusion approach combining RGB and NIR imagery with LiDAR-derived surface curvature. Both Allen et al. [21] and Fan et al. [15] proposed unimodal approaches that generate 2D projections of 3D point clouds, transforming structural information into image-like formats for convolutional processing. While these uni- and multimodal architectures achieve promising classification performance, they all rely on substantial data preprocessing or transformation, inherently resulting in the loss of modality-specific geometric and radiometric detail. The comparison to these approaches facilitates a comprehensive performance baseline across methodological categories against which the benefits of a native-format, dynamically fused multimodal model such as MMTSCNet can be critically assessed.

Our evaluation is structured across multiple data configurations, including ALS and ULS point clouds collected under both leaf-on and leaf-off conditions, as well as combinations that include FWF data when available. This enables us to test each model’s performance under varying levels of data richness, structure, and sensor modality and to assess the relative strengths of native-format multimodal fusion compared to unimodal and feature-transformed alternatives. The inclusion of FWF attributes is particularly relevant as few existing studies exploit FWF signals in deep learning workflows despite their potential to encode vertical structure and scattering properties.

2.6. Accuracy Assessment

To evaluate the performance of the models, confusion matrices were generated based on predictions made on the corresponding and previously unseen test datasets as well as additional metrics such as the Kappa coefficient (equations are given in Table 7). For each data subset, 10 predictions were collected for the corresponding shuffled test dataset, and the mean predicted values were used to construct the confusion matrices.

3. Results

Figure 5, Table 8, Table 9 and Table 10 summarize our results. All metrics are reported on test data.

There is little misclassification between most species, though some errors occurred between the species Carpinus Betulus, Fagus Sylvatica, and Picea Abies (compare Figure 5). For all subsets of data, except for the ULS data with FWF data available under leaf-off conditions, misclassification of Quercus Petraea and Carpinus Betulus, Fagus Sylvatica as well as Picea Abies occurred. Based on the normalized values for each species, it is evident that MMTSCNet was able to score exceptionally high accuracies for most species, except for Carpinus Betulus, Fagus Sylvatica, and Quercus Petraea. For these species, some misclassification is evident across all subsets of data.

With respect to different input data, classification performance on the combination of ALS data with FWF data scored highest, achieving an OA of nearly 97%, an MAF of nearly 97%, and a Kappa Coefficient of 0.96. Results for the other data subsets show that the ALS data in combination with or even without FWF data produced the best results, closely followed by using the ULS data. Additionally, it is evident that the inclusion of FWF data provided significant benefits to the classification performance, resulting in an increase in the MAF of +4.66% for the ALS data, +4.69% for the ULS data under leaf-on conditions, and +2.59% for the ULS data under leaf-off conditions.

Across all subsets of data, MMTSCNet produced the best results for Pseudotsuga Menziesii and Quercus Rubra, indicated by the high F1-Scores, Precision and Recall values (compare Table 9). With an F1-Score of 99% for the ALS data with FWF data present, 96% for the ULS data under leaf-on conditions and 100% for the ULS data with FWF data present under leaf-on condition, scores for Quercus Rubra are the highest in almost all experiments. The only exception occurs for the ALS data without FWF data present, where MMTSCNet performed better on samples of Picea Abies by a small margin. For the ULS data under leaf-off conditions, Pseudotsuga Menziesii was classified with the highest F1-Score of 98% when no FWF data were present, and for the remaining subset of ULS data under leaf-off conditions where FWF data were present, an F1-Score of 100% was produced for Quercus Petraea. Judging by these species-specific results, a pattern where coniferous tree species were classified with less accuracy than broad-leaved tree species can be observed.

Table 10 also shows results for PointNet++ with various sampling strategies and DSTCN, which were trained on the same dataset by Liu et al. [20] and Zhang et al. [13], respectively. Overall, MMTSCNet managed to outperform both models across all subsets of data, except for the ALS data with no FWF data present.

For the ALS data, MMTSCNet achieved an increase in OA of +12.79% in comparison to the best PointNet++ instance with K-Means sampling trained by Zhang et al. [13] when FWF data were included. Without the presence of FWF data, MMTSCNet still achieved an increase in OA of +8.14% compared to PointNet++ (KS). In a direct comparison with DSTCN, our model was able to achieve an increase in OA of +3.19% when FWF data were used but fell short by −1.08% without FWF data present.

MMTSCNet was also able to achieve classification performance increases for the ULS data subsets: For the ULS data under leaf-on conditions, MMTSCNet was able to achieve an increase in OA of +2.22% without FWF data present when compared to PointNet++ with Non-uniform Grid and Farthest Point Sampling (NGFPS) by Liu et al. [20]. When FWF data were present, the increase in OA amounted to +6.67%. In addition to these significant accuracy improvements, MMTSCNet was also used to classify seven species, while Liu et al. [20] only classified four species with their instance of PointNet++ with NGFPS.

The classification results for ULS data under leaf-off conditions also show an increase in OA by +4.49% with no FWF data present. With FWF data, the increase in OA produced by MMTSCNet when compared to PointNet++ (NGFPS) amounted to +6.74%. For this subset of data, MMTSCNet was used to classify five different species, while Liu et al. [20] trained PointNet++ with NGFPS on four species.

Overall, MMTSCNet was able to achieve significant increases in classification performance when FWF data were used. For the data subsets without FWF data, our proposed model produced lower, yet also significant, increases in classification accuracy in comparison to the PointNet++ architecture, while falling short by a small margin when compared to DSTCN.

Ablation Study

To evaluate the impact of each modality on the classification performance and the effectiveness of the proposed DMS module, we conducted a systematic ablation study. For this purpose, we opted to perform two main ablation tests: Branch Removal and Branch Weight Assessment, as well as DMS Removal.

The Branch Removal and Branch Weight Assessment test involved re-training various instances of MMTSCNet with individual branches disabled in order to quantify the classification performance difference when a reduced number of modalities was available. This test was conducted on the ALS dataset with FWF data available as this combination produced the best results during our study. Additionally, the architectural hyperparameters shown in Table 6 were used while conducting both ablation tests. Table 11 showcases the results produced by the MMTSCNet instances with disabled input modalities. To allow for a direct comparison to the baseline classification performance on the combination of ALS and FWF data, confusion matrices (see Figure 6) were generated by averaging the results of ten prediction cycles over the entire test dataset. The performance metrics shown in Table 10 were calculated for each model instance. Additionally, we logged the branch weights learned by the DMS for each model instance after training the instance in order to obtain an insight into the absolute changes in the branch weights during this ablation test. A combination of these weights and the baseline weights generated during the training of MMTSCNet with all four branches on the ALS and FWF data combination is shown in Figure 7.

The second ablation test, the DMS Removal, focused on re-training MMTSCNet on the aforementioned combination of ALS and FWF data while the DMS was disabled. This resulted in equal weights and thus equal contribution to the classification for each individual branch (25%), which allowed us to assess the impact of the DMS on the overall classification performance. The results of his ablation test are also showcased in Table 11.

With the ablation study, we systematically evaluated the impact of individual modality branches and the DMS on the overall classification performance of MMTSCNet. The full model, encompassing all five components—DMS, PCE, MM, FIP, and TDIP—achieved the highest performance across all evaluated metrics, with an OA of 0.97, a MAF of 0.97, a MAP of 0.97, a MAR of 0.97 and a Kappa coefficient of 0.96 as showcased in Table 10 and Table 11.

With the DMS module disabled while retaining all other branches, performance declined notably, with the OA dropping to 0.82 and the Kappa coefficient decreasing to 0.61. Without the DMS, modality contributions were fixed at 25% each, eliminating adaptive weighting and thus limiting the model’s performance. Removing specific branches while retaining the DMS led to varying degrees of performance reduction. The absence of the TDIP branch resulted in an OA of 0.88 while disabling the FIP branch resulted in a slightly higher OA of 0.90. For the MMTSCNet with both the FIP and the TDIP disabled, OA remained at 0.90, suggesting that point clouds and numeric features were able to partially compensate for the missing modalities.

More severe degradation was observed when multiple branches were simultaneously disabled. MMTSCNet with only the PCE branch exhibited a sharp performance decline, with the OA degrading to 0.48 and a Kappa coefficient of only 0.08. When the PCE and TDIP branches were enabled, MMTSCNet achieved an OA of 0.63 and a Kappa coefficient of 0.22. Removing the PCE while keeping the FIP, TDIP, and MM branches resulted in an OA of 0.82, indicating that point cloud features were not necessary for maintaining high classification performance if alternative data representations were available.

The confusion matrices revealed that MMTSCNet with all branches and the DMS module produced highly accurate classifications, with most class predictions closely following the diagonal. When modalities were disabled, an increase in misclassifications was observed, particularly in classes with similar morphology or height distribution. The removal of the DMS module led to more evenly distributed misclassification patterns, suggesting a loss of adaptive decision-making.

The analysis of attention weights in Figure 7 shows that the inclusion of the MM branch led to extremely small weights for the other branches, highlighting the importance of engineered numerical features to accurately differentiate between tree species. Additionally, it is evident that with the MM branch disabled, the weight assigned to the FIP branch is marginally larger than the weight assigned to the TDIP branch. Across all tests, the weights assigned to the PCE branch are the smallest. When the DMS module was disabled, all branches contributed equally, which correlated with a notable decline in classification performance.

4. Discussion

Compared to other common approaches to automated tree species classification, MMTSCNet uses a multimodal approach, combining different types of 3D and 2D input data to achieve high classification accuracy and quality while dynamically adjusting the contribution of each processing branch to achieve optimal classification results. This strategy has not often been employed for tree species classification tasks.

4.1. Discussion of Our Results

In a direct comparison to the results of several PointNet++ instances with various sampling strategies trained by Zhang et al. [13] and Liu et al. [20] on ALS data, MMTSCNet achieved a significantly higher OA, with an increase of up to 12.8% for the same seven tree species. The main reason for this significant improvement is probably the additional information due to feature engineering. Since FWF data are available for all twelve forest plots in the study area and was provided as a byproduct of the ALS campaign, MMTSCNet was developed with these data in mind. However, MMTSCNets results clearly indicate that while the FWF data drastically improves the classification performance, similar or significantly better results than other state-of-the-art model architectures are also achieved without this additional information derived from FWF data.

MMTSCNet was able to surpass PointNet++ with NGFPS with an average increase in OA of

\approx 5 %

while also classifying seven instead of just four different tree species for ULS data under leaf-on conditions and five instead of four species for ULS data under leaf-off conditions. This significant improvement is achieved by our multimodal approach which allows for the extraction of more information and thus detail from the point clouds. The addition of numerical features also adds weight to the most important point-to-point relations since these can be captured by both the PCE and the numerical features. While the numerical features assist in capturing details and weighting of significant features, the generated bidirectional depth images contribute to the description of the morphology of a tree point cloud. Furthermore, the DMS module in the classification head of our proposed architecture allows MMTSCNet to leverage self attention to dynamically learn and adjust modality importance, resulting in improved classification performance.

While MMTSCNet was not able to surpass the OA of 94%, achieved by DSTCN [13] on the ALS data subset, falling short by 1.08% at 93% OA, when the available FWF data were not used, it showed a significant improvement in OA of up to 3.19% when the numerical features generated from the FWF data were utilized. This highlights the benefits of FWF data for tree species classification and the importance of feature engineering. For all instances, the addition of the FWF features improved the classification performance with minor misclassification. The main limiting factor was identified to be the total number of numerical FWF features. While the selected features sum up to a total of

\approx 7.5 %

of all available numerical features, more numerical features should be derived from the FWF data to further increase the classification performance. As an additional source of spectral information, multispectral imagery should be used to delineate further distinct features, as it has been proven to improve classification results [11].

Regarding the shortcomings of MMTSCNet on ALS data without numerical FWF features, several possible reasons were identified, including the sparsity of ALS point clouds, which lack information about the lower canopy and stem of individual trees due to varying parameters during the data acquisition. The density of ALS point clouds is mainly influenced by the amount of overlap between individual flight strips, the altitude at which the sensor is deployed, as well as the airborne speed of the aircraft. With a relatively low spatial resolution of ≈72.5 points per m², the ALS point clouds do not provide a sufficient number of points to generate feature-rich depth images, which is visible in Figure 8. The generated numerical features are also prone to erroneous calculations based on a low spatial resolution, as the geometric numerical features can be especially inaccurate if the point density is extremely sparse below the canopy. In addition to that, the ALS data was acquired when the trees were in a leaf-on state, which adds inherent noise from swinging leaves and branches. This contributes to possible geometric inaccuracy and hinders the penetration capabilities of laser beams, as multiple layers of leaves have to be penetrated to reach the stem or even the ground. While these sources of error have a direct impact on the classification performance of MMTSCNet, Zhang et al. [13] conducted manual segmentation, visual inspection, and selection of point clouds in order to select optimal training samples while MMTSCNet was trained on every available point cloud with a sufficient number of points. This indicates that the results by MMTSCNet are lower due to inconsistencies in data quality, which were eliminated for PointNet++ with NGFPS and DSTCN by Zhang et al. [13].

The difference in performance between ALS and ULS data can also be attributed to the spatial resolution of the source data. Compared to the ALS data, the ULS data have a spatial resolution of ≈916 points per m², which is approximately 840 points per m² more than the ALS data. This is a direct result of the high overlap between flight strips, the sensor altitude, the velocity of the UAVs during the data acquisition, and the sensor’s characteristics itself. The resulting point clouds have a significantly higher point density, especially below the canopy, as the lower flight altitude and velocity, as well as the higher overlap of flight lines, allow for a deeper penetration of the canopy. A higher point density ultimately results in more detail being featured in the depth images, as well as a more accurate calculation of numerical features, especially those involving the stem of the tree. However, during the downsampling of the point clouds, important structural information is lost, which has a direct impact on classification performance. This results in an imbalance in the information content of the different input modalities, which may have a less significant impact on the lower-resolution ALS data. While the DMS was developed for this case, the low amount of training epochs had an impact on the DMSs ability to weight features appropriately.

The results of our ablation study presented in Section 3 demonstrate the necessity of both multimodal fusion and dynamic modality weighting for achieving optimal classification performance. The significant decline in performance observed when disabling the DMS module suggests that fixed, equal weighting of all modalities is suboptimal. The performance trends observed after disabling individual branches provide insight into the relative contributions of each modality. The TDIP and FIP branches play a crucial role, as their removal resulted in moderate declines in classification accuracy, indicating their importance in feature extraction. However, MMTSCNet was still able to maintain reasonable classification performance without one of these branches, suggesting redundancy in some extracted features. The relatively smaller performance drop when disabling only one of these branches suggests that they provide complementary but overlapping information.

The performance degradation seen when multiple branches were disabled highlights the importance of a multimodal approach. The drastic drop in classification accuracy when only the PCE was retained suggests that point cloud data alone is insufficient for robust classification, likely due to the limitations of 3D point features in capturing certain discriminative properties and the selected backbone architecture. Conversely, the relatively strong performance of a model without the PCE branch suggests that, in certain contexts, the features derived from frontal and top-down images, along with enhanced metrics, can sufficiently compensate for the absence of point cloud data. These findings indicate that the performance of the PCE branch alone is suboptimal and can potentially be improved by changing the underlying architecture, e.g., by exchanging the PCE with an instance of PointNet++ with NGFPS, which has produced good overall results on the dataset in the study by Zhang et al. [13]. Although such changes can have a negative impact on the tuning and training time of MMTSCNet and introduce overfitting on small datasets, this is appropriate for larger datasets such as the ForSpecies20k benchmark dataset by Puliti et al. [48].

The confusion matrices in Figure 6 further reinforce these findings, demonstrating that the full model is highly effective in distinguishing between classes, while models with missing modalities exhibit increased confusion, particularly among species with morphological similarities. The absence of the DMS module, in particular, led to a more uniform distribution of misclassifications, confirming that adaptive weighting plays a crucial role in optimizing classification confidence. The attention weight analysis (see Figure 7) provides additional support for these conclusions. The learned weight distribution in the full model indicates that the DMS module effectively assigns higher weights to the most informative modalities. The loss of this adaptive weighting mechanism in the DMS-disabled model results in suboptimal contributions from each modality, leading to decreased classification performance.

Overall, the findings strongly support the adoption of a model-level and dynamically weighted multimodal approach for tree species classification tasks in complex datasets. The ability to selectively emphasize the most relevant features through the DMS module, combined with the complementary nature of different data modalities, is critical for achieving high classification accuracy.

4.2. Comparison to Other Approaches

After this direct comparison to PointNet++ and DSTCN, we now discuss other state-of-the-art approaches that were conducted on different datasets and data types. Ferreira et al. [29] achieved an F1-score of 73.7% using two ResUNet encoder-decoders on fused RGBNIR and ALS features for the classification of six tree species. While this highlights the potential of modality-fusion and feature-engineering, our approach was able to achieve a higher F1-score of ≈92% on ALS data for seven tree species, indicating a significant improvement in classification performance (+18.3%). This improvement can be attributed to the advanced feature extraction from LiDAR point clouds, bidirectional depth images, engineered features, and the usage of a multimodal self-attention mechanism in our proposed architecture. The similar, DenseNet-based approach by Reisi Gahrouei et al. [27], which also used a feature-level fusion of ALS data and RGBNIR imagery, achieved a 78% OA for nine tree species. While this indicates a selection of potent features for the classification of tree species, the significantly higher performance of MMTSCNet on a similar number of tree species (+15% OA) highlights the effectiveness of a multimodal model architecture. This suggests that leveraging engineered features in combination with point cloud-based structural 2D and 3D information allows for a more comprehensive representation of tree morphology, ultimately enhancing classification accuracy and robustness.

Compared to the approach by Allen et al. [21], who classified TLS tree point clouds into five species by generating six perspective projection images per point cloud using a stack of six ResNet18 models, our approach achieved an increase in the OA by +12.4% from 80.5% OA to 93% OA. This further demonstrates the superiority of direct point cloud processing over unimodal 2D projection-based methods. While the proposed perspective projections capture some morphological properties, they fail to preserve fine-grained 3D structural information essential for tree species classification. Fan et al. [15] proposed a similar method, converting ALS point clouds into 2D grayscale depth images and 2D bidirectional, color-coded projection images. The authors tested multiple unimodal 2D architectures such as ResNet50, DenseNet-121, and VGG-19 on several subsets of data, with ResNet50 achieving the highest MAF of 91% for nine tree species using the bidirectional, color-coded projection images. While this method demonstrates the importance of depth representation, MMTSCNet was able to surpass the achieved MAF by +5.88%, leveraging numerical features derived from FWF and LiDAR data, which aid in distinguishing species with similar depth profiles.

Liu et al. [28] proposed a multimodal model architecture (TSCMDL) combining PointMLP for point cloud feature extraction and ResNet50 for image feature extraction. While TSCMDL achieved an F1-score of 98.5%, it was trained on a highly controlled dataset containing only two tree species. In contrast, MMTSCNet was tested on seven tree species and achieved an F1-score of 97%, demonstrating greater generalization capability across diverse tree species. This suggests that, while TSCMDL performed slightly better in a highly restricted setting, MMTSCNet is superior for more complex, multi-species classification tasks.

The superior classification performance of MMTSCNet is enabled by its advanced feature representation capabilities, including dynamic modality weighting via the DMS module. However, this architectural complexity results in approximately 4 million trainable parameters, slightly more than twice the parameter count of PointNet++ (typically ≈1.6–1.8 million) with MSG and three Set Abstraction layers. While PointNet++ offers a lightweight baseline with competitive performance on raw point clouds, its unimodal nature limits its ability to leverage complementary structural and radiometric information, especially in classification tasks with a large number of tree species. Despite the increased parameter count, MMTSCNet remains computationally feasible for real-world deployment. The use of parallel processing branches and dynamic fusion allows for efficient inference at the batch level, with scalability for large-scale airborne surveys. While inference time is naturally higher than that of PointNet++, the trade-off is justified by the significantly improved classification accuracy, particularly under heterogeneous forest conditions and variable sensor inputs.

This balance between model complexity and predictive performance is critical for operational forestry applications, where species diversity, seasonal variation, and sensor variability require adaptable and robust classification systems. While it is not suitable for real-time applications, MMTSCNet offers a compelling solution for national forest inventories, biodiversity monitoring, and ecological modeling where accuracy and interpretability are prioritized over minimal computational cost.

5. Conclusions and Outlook

We propose a new architecture for the automated classification of tree species, MMTSCNet, which features a multimodal and highly modular approach. Point clouds as well as bidirectional, color-coded depth images and numerical features are processed in parallel, leading to exceptional results on ALS and ULS data across up to seven different tree species. Across all available subsets of data, the addition of numerical features derived from FWF data significantly increased the classification performance, highlighting the benefits of the availability of different source data. With OAs of well over 97% for ALS data and 95% for ULS data, and

\approx 4 M

trainable parameters, MMTSCNet is well-suited for real-world applications.

Our results demonstrate that MMTSCNet achieves exceptional performance through extensive feature engineering and modality fusion while maintaining strong generalization capability across up to seven tree species. Future work should further improve the model by incorporating additional modalities and evaluating their contribution, extending the feature set derived from FWF data, and, more importantly, changing the individual processing branches to superior model architectures and evaluating these on a similar dataset. Furthermore, refinements to the DMS module to enhance its adaptive capabilities, potentially improving performance in scenarios with limited or noisy modality inputs, are imperative. Additionally, MMTSCNet should be trained and benchmarked on larger datasets such as the FORspecies20k dataset [48] and compared to other model architectures to provide insights into the transferability and scalability of the achieved results. Our code is available at https://github.com/jvahrenhold97/MMTSCNET (accessed on 30 March 2025), and we would like to encourage other researchers to improve upon MMTSCNet and the DMS module.

Author Contributions

Conceptualization, J.R.V., M.B. and M.S.M.; Data curation, J.R.V.; Formal analysis, J.R.V.; Investigation, J.R.V.; Methodology, J.R.V.; Project administration, J.R.V., M.B. and M.S.M.; Resources, J.R.V., M.B. and M.S.M.; Software, J.R.V.; Supervision, M.B. and M.S.M.; Validation, J.R.V.; Visualization, J.R.V.; Writing—original draft, J.R.V.; Writing—review and editing, M.B. and M.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in the study [30,31] are openly available at https://doi.org/10.1594/PANGAEA.942856 (accessed on 30 March 2025) and https://doi.org/10.1594/PANGAEA.947038 (accessed on 30 March 2025), respectively. The code for MMTSCNET is available at https://github.com/jvahrenhold97/MMTSCNET (accessed on 30 March 2025).

Acknowledgments

The publication was supported by the publication fund of the Technical University of Applied Sciences Würzburg-Schweinfurt.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Decuyper, M. Combining Conventional Ground-Based and Remotely Sensed Forest Measurements. Ph.D. Thesis, Wageningen University, Wageningen, The Netherlands, 2018. [Google Scholar] [CrossRef]
Emiliani, G.; Giovannelli, A. Tree Genetics: Molecular and Functional Characterization of Genes. Forests 2023, 14, 534. [Google Scholar] [CrossRef]
Duncanson, L.; Liang, M.; Leitold, V.; Armston, J.; Krishna Moorthy, S.M.; Dubayah, R.; Costedoat, S.; Enquist, B.J.; Fatoyinbo, L.; Goetz, S.J.; et al. The effectiveness of global protected areas for climate change mitigation. Nat. Commun. 2023, 14, 2908. [Google Scholar] [CrossRef] [PubMed]
Fettig, C.J.; Klepzig, K.D.; Billings, R.F.; Munson, A.S.; Nebeker, T.E.; Negrón, J.F.; Nowak, J.T. The effectiveness of vegetation management practices for prevention and control of bark beetle infestations in coniferous forests of the western and southern United States. For. Eco. Manag. 2007, 238, 24–53. [Google Scholar] [CrossRef]
Podgórski, T.; Schmidt, K.; Kowalczyk, R.; Gulczyńska, A. Microhabitat selection by Eurasian lynx and its implications for species conservation. Acta Theriol. 2008, 53, 97–110. [Google Scholar] [CrossRef]
Kumar, P.; Debele, S.E.; Sahani, J.; Rawat, N.; Marti-Cardona, B.; Alfieri, S.M.; Basu, B.; Basu, A.S.; Bowyer, P.; Charizopoulos, N.; et al. An overview of monitoring methods for assessing the performance of nature-based solutions against natural hazards. Earth-Sci. Rev. 2021, 217, 103603. [Google Scholar] [CrossRef]
McRoberts, R.; Tomppo, E. Remote sensing support for national forest inventories. Remote Sens. Environ. 2007, 110, 412–419. [Google Scholar] [CrossRef]
Huete, A.R. Vegetation Indices, Remote Sensing and Forest Monitoring. Geogr. Compass 2012, 6, 513–532. [Google Scholar] [CrossRef]
Fassnacht, F.E.; Latifi, H.; Stereńczak, K.; Modzelewska, A.; Lefsky, M.; Waser, L.T.; Straub, C.; Ghosh, A. Review of studies on tree species classification from remotely sensed data. Remote Sens. Environ. 2016, 186, 64–87. [Google Scholar] [CrossRef]
Briechle, S.; Krzystek, P.; Vosselman, G. Classification of tree species and standing dead trees by fusing UAV-based LiDAR data and multispectral imagery in the deep neural network PointNet++. ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci. 2020, V-2-2020, 203–210. [Google Scholar] [CrossRef]
Hell, M.; Brandmeier, M.; Briechle, S.; Krzystek, P. Classification of Tree Species and Standing Dead Trees with Lidar Point Clouds Using Two Deep Neural Networks: PointCNN and 3DmFV-Net. PFG–J. Photogramm. Remote Sens. Geoinf. Sci. 2022, 90, 103–121. [Google Scholar] [CrossRef]
Qiao, Y.; Zheng, G.; Du, Z.; Ma, X.; Li, J.; Moskal, L.M. Tree-Species Classification and Individual-Tree-Biomass Model Construction Based on Hyperspectral and LiDAR Data. Remote Sens. 2023, 15, 1341. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, J.; Wu, Y.; Zhao, Y.; Wu, B. Deeply supervised network for airborne LiDAR tree classification incorporating dual attention mechanisms. GIScience Remote Sens. 2024, 61, 2303866. [Google Scholar] [CrossRef]
Lin, Y.; Herold, M. Tree species classification based on explicit tree structure feature parameters derived from static terrestrial laser scanning data. Agric. For. Meteorol. 2016, 216, 105–114. [Google Scholar] [CrossRef]
Fan, Z.; Zhang, W.; Zhang, R.; Wei, J.; Wang, Z.; Ruan, Y. Classification of Tree Species Based on Point Cloud Projection Images with Depth Information. Forests 2023, 14, 2014. [Google Scholar] [CrossRef]
Lin, Y.; Hyyppä, J. A comprehensive but efficient framework of proposing and validating feature parameters from airborne LiDAR data for tree species classification. Int. J. Appl. Earth Observ. Geoinf. 2016, 46, 45–55. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 31, pp. 5105–5114. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar] [CrossRef]
Liu, B.; Chen, S.; Tian, X.; Huang, H.; Ren, M. Tree Species Classification of Point Clouds from Different Laser Sensors Using the PointNet++ Deep Learning Method. In Proceedings of the 2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 1565–1568. [Google Scholar] [CrossRef]
Allen, M.J.; Grieve, S.W.D.; Owen, H.J.F.; Lines, E.R. Tree species classification from complex laser scanning data in Mediterranean forests using deep learning. Methods Ecol. Evol. (MEE) 2023, 14, 1657–1667. [Google Scholar] [CrossRef]
Fan, Z.; Wei, J.; Zhang, R.; Zhang, W. Tree Species Classification Based on PointNet++ and Airborne Laser Survey Point Cloud Data Enhancement. Forests 2023, 14, 1246. [Google Scholar] [CrossRef]
Chen, J.; Chen, Y.; Liu, Z. Classification of Typical Tree Species in Laser Point Cloud Based on Deep Learning. Remote Sens. 2021, 13, 4750. [Google Scholar] [CrossRef]
Zhang, C.; Xia, K.; Feng, H.; Yang, Y.; Du, X. Tree species classification using deep learning and RGB optical images obtained by an unmanned aerial vehicle. J. For. Res. 2021, 32, 1879–1888. [Google Scholar] [CrossRef]
Egli, S.; Höpke, M. CNN-Based Tree Species Classification Using High Resolution RGB Image Data from Automated UAV Observations. Remote Sens. 2020, 12, 3892. [Google Scholar] [CrossRef]
Schiefer, F.; Kattenborn, T.; Frick, A.; Frey, J.; Schall, P.; Koch, B.; Schmidtlein, S. Mapping forest tree species in high resolution UAV-based RGB-imagery by means of convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2020, 170, 205–215. [Google Scholar] [CrossRef]
Reisi Gahrouei, O.; Côté, J.-F.; Bournival, P.; Giguère, P.; Béland, M. Comparison of Deep and Machine Learning Approaches for Quebec Tree Species Classification Using a Combination of Multispectral and LiDAR Data. Can. J. Remote Sens. 2024, 50, 2359433. [Google Scholar] [CrossRef]
Liu, B.; Hao, Y.; Huang, H.; Chen, S.; Li, Z.; Chen, E.; Tian, X.; Ren, M. TSCMDL: Multimodal Deep Learning Framework for Classifying Tree Species Using Fusion of 2-D and 3-D Features. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4402711. [Google Scholar] [CrossRef]
Ferreira, M.P.; Dos Santos, D.R.; Ferrari, F.; Filho, L.C.T.C.; Martins, G.B.; Feitosa, R.Q. Improving urban tree species classification by deep-learning based fusion of digital aerial images and LiDAR. Urban For. Urban Green. 2024, 94, 128–240. [Google Scholar] [CrossRef]
Weiser, H.; Schäfer, J.; Winiwarter, L.; Krašovec, N.; Seitz, C.; Schimka, M.; Anders, K.; Baete, D.; Braz, A.S.; Brand, J.; et al. Terrestrial, UAV-Borne, and Airborne Laser Scanning Point Clouds of Central European Forest Plots, Germany, with Extracted Individual Trees and Manual Forest Inventory Measurements. [Dataset]; PANGAEA: Bremen, Germany, 2022. [Google Scholar] [CrossRef]
Weiser, H.; Schäfer, J.; Winiwarter, L.; Fassnacht, F.E.; Höfle, B. Airborne Laser Scanning (ALS) Point Clouds with Full-Waveform (FWF) Data of Central European Forest Plots, Germany. [Dataset]; PANGAEA: Bremen, Germany, 2022. [Google Scholar] [CrossRef]
Felden, J.; Möller, L.; Schindler, U.; Huber, R.; Schumacher, S.; Koppe, R.; Diepenbroek, M.; Glöckner, F.O. PANGAEA—Data Publisher for Earth & Environmental Science. Sci. Data 2023, 10, 347. [Google Scholar] [CrossRef]
Weiser, H.; Schäfer, J.; Winiwarter, L.; Krašovec, N.; Fassnacht, F.E.; Höfle, B. Individual tree point clouds and tree measurements from multi-platform laser scanning in German forests. Earth Sys. Sci. Data 2022, 14, 2989–3012. [Google Scholar] [CrossRef]
Deutscher Wetterdienst (DWD). Wetter und Klima—Deutscher Wetterdienst—Presse—Deutschlandwetter im Jahr 2022. 2022. Available online: https://www.dwd.de/DE/presse/pressemitteilungen/DE/2022/20221230_deutschlandwetter_jahr2022_news.html (accessed on 19 January 2025).
OpenStreetMap Contributors. 2024. Available online: https://download.geofabrik.de (accessed on 19 January 2025).
Li, J.; Baoxin, H.; Noland, T.L. Classification of tree species based on structural features derived from high-density LiDAR data. Agr. For. Met. 2013, 171/172, 104–114. [Google Scholar] [CrossRef]
Michałowska, M.; Rapiński, J. A Review of Tree Species Classification Based on Airborne LiDAR Data and Applied Classifiers. Remote Sens. 2021, 13, 353. [Google Scholar] [CrossRef]
Guo, Y.; Hongsheng, Z.; Qiaosi, L.; Yinyi, L.; Michalski, J. New morphological features for urban tree species identification using LiDAR point clouds. Urban For. Urban Green. 2022, 71, 127558. [Google Scholar] [CrossRef]
Hovi, A.; Korhonen, L.; Vauhkonen, J.; Korpela, I. LiDAR waveform features for tree species classification and their sensitivity to tree- and acquisition-related parameters. Remote Sens. Environ. 2016, 173, 224–237. [Google Scholar] [CrossRef]
Shi, Y.; Skidmore, A.K.; Wang, T.; Holzwarth, S.; Heiden, U.; Pinnel, N.; Zhu, X.; Heurich, M. Tree species classification using plant functional traits from LiDAR and hyperspectral data. Int. J. Appl. Earth Obs. Geoinf. 2018, 73, 207–219. [Google Scholar] [CrossRef]
Shi, Y.; Wang, T.; Skidmore, A.K.; Heurich, M. Important LiDAR metrics for discriminating forest tree species in Central Europe. ISPRS J. Photogramm. Remote Sens. 2018, 137, 163–174. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
O’Malley, T.; Bursztein, E.; Long, J.; Chollet, F.; Jin, H.; Invernizzi, L. KerasTuner. GitHub Repository. Version 1.4.7. 2019. Available online: https://keras.io/keras_tuner/ (accessed on 20 January 2025).
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 318–327. [Google Scholar] [CrossRef]
Puliti, S.; Lines, E.; Müllerová, J.; Frey, J.; Schindler, Z.; Straker, A.; Allen, M.; Winiwarter, L.; Rehush, N.; Hristova, H.; et al. Benchmarking tree species classification from proximally-sensed laser scanning data: Introducing the FOR-species20K dataset. Methods Ecol. Evol. (MEE) 2025, 16, 801–818. [Google Scholar] [CrossRef]

Figure 1. Distribution of forest plots in the Karlsruhe (top right) and Bretten (bottom right) areas. The Airborne Laser Scanning (ALS) and UAV-borne Laser Scanning (ULS) data extent is hatched, while the full inventory data extent (ALS, ULS, Terrestrial Laser Scanning (TLS), and in situ measurements) is highlighted in orange. Map data: ^©OpenStreetMap Contributors [35], distributed under the OpenData Commons Open Database License (ODbL) v1.0.

Figure 2. The entire preprocessing pipeline (with FWF data available) employed to generated the four types of data: Tree point clouds, bidirectional, color-coded depth images of the tree point clouds, as well as numerical features derived from the LiDAR and FWF tree point clouds. The colors in this figure represent the introduction of new data types and the splitting of the data into a training, validation, and test dataset. This entire process was repeated for all six subsets of data.

Figure 3. Frontal and top-down color-coded depth images of four individual tree point clouds from the ULS data subset under leaf-off conditions. Depth increases with decreasing brightness. The four species Picea abies, Carpinus betulus, Pseudotsuga menziesii, and Fagus sylvatica are shown in order to highlight morphological differences between species.

Figure 4. Simplified depiction of the proposed MMTSCNet architecture with its four distinct branches, Dynamic Modality Scaling module, and classification head. The Point Cloud Extractor is responsible for extracting features from the input point clouds, two instances of EfficientNetV2S extract features from bidirectional, color-coded depth images and the Metrics Model extracts features from a comprehensible list of numerical features. In this figure, ResidualConn denotes a residual connection, MSG denotes multi-scale grouping, Conv represents a convolution operation or block, LayerNorm corresponds to a layer normalization operation, while GroupNorm corresponds to a group normalization operation. HybridPool represents a combination of global maximum and average pooling, and Swish represents an activation layer with the Swish activation function.

Figure 5. Normalized confusion matrices produced by MMTSCNet on the test data for each individual subset of data.

Figure 6. Normalized confusion matrices (in %) generated during our ablation study for the various combinations of active branches/modules of MMTSCNet (in brackets). A red color-scheme indicates that the DMS was active, while a blue color-scheme indicates that the DMS was inactive. PCE denotes the Point Cloud Extractor branch, FIP the Frontal Image Processing branch, TDIP the Top-Down Image Processing branch, EMM the Enhanced Metrics Model branch, and DMS the Dynamic Modality Scaling module.

Figure 7. Attention weights generated during our ablation study for the various MMTSCNet instances with combinations of active modules and branches (in brackets). PCE denotes the Point Cloud Extractor branch, FIP the Frontal Image Processing branch, TDIP the Top-Down Image Processing branch, EMM the Enhanced Metrics Model branch, and DMS the Dynamic Modality Scaling module.

Figure 8. Carpinus betulus and Fagus sylvatica control images (colored by point height) taken from the ULS data under leaf-on conditions compared to three variants of Carpinus betulus and Fagus sylvatica from the ALS data. Differences in the point density and general inner-species morphology are evident.

Table 1. Tree species characteristics in the source datasets sorted by acquisition (ALS and ULS) and the leaf condition (Leaf-On and Leaf-Off) if applicable. The seven predominant tree species that were used in our study are highlighted. All species are presented with their mean height, height standard deviation (STD), mean point density per m³, number of samples, and leaf morphology.

Species (Latin)	Mean Height	Height STD	Mean Point Density per m³ (ALS)	Mean Point Density per m³ (ULS)	Samples (ALS)	Samples (ULS Leaf-On)	Samples (ULS Leaf-Off)	Leaf Morph.
Abies alba	23.70	6.89	2.76	35.02	20	7	12	Coniferous
Acer campestre	12.34	7.16	3.93	27.21	7	6	11	Broad-Leaved
Acer pseudoplantanus	19.17	7.77	2.87	19.41	39	36	39	Broad-Leaved
Betula pendula	20.16	6.14	4.35	29.95	6	4	4	Broad-Leaved
Carpinus betulus	15.68	5.44	2.27	15.36	90	89	132	Broad-Leaved
Fagus sylvatica	23.42	7.74	2.62	19.27	397	366	509	Broad-Leaved
Fraxinus excelsior	14.47	6.04	3.21	21.60	11	10	18	Broad-Leaved
Juglans regia	16.80	3.91	2.74	12.04	19	19	19	Broad-Leaved
Larix decidua	33.77	4.00	2.36	26.05	30	30	36	Coniferous
Picea abies	18.81	5.98	4.21	30.09	205	200	331	Coniferous
Pinus sylvestris	29.95	3.47	2.33	25.52	158	103	79	Coniferous
Prunus avium	16.14	3.71	3.28	18.61	19	19	37	Broad-Leaved
Prunus serotina	11.11	2.49	3.94	0.00	7	0	0	Broad-Leaved
Pseudotsuga menziesii	36.84	5.51	2.03	22.11	191	140	164	Coniferous
Quercus petraea	18.88	7.48	3.82	25.45	156	152	262	Broad-Leaved
Quercus robur	27.87	2.58	3.07	25.26	7	6	6	Broad-Leaved
Quercus rubra	22.47	4.03	3.09	44.23	111	92	9	Broad-Leaved
Robinia pseudoacacia	11.34	0.00	4.42	0.00	1	0	0	Broad-Leaved
Salix caprea	16.79	0.14	4.36	21.94	1	1	2	Broad-Leaved
Sorbus torminalis	13.55	0.21	0.00	5.32	0	1	1	Broad-Leaved
Tilia (Not Specified)	21.12	3.49	2.18	18.82	4	4	4	Broad-Leaved
Tsuga heterophylla	19.91	0.07	1.36	12.97	1	1	1	Coniferous

Table 2. Total number of segmented individual tree point clouds in the source dataset by plot and data acquisition method according to Weiser et al. [33].

Plot	ALS (Leaf-On)	ULS (Leaf-On)	ULS (Leaf-Off)
BR01	514	503	503
BR02	42	42	41
BR03	195	141	141
BR04	9	-	-
BR05	278	278	278
BR06	29	29	29
BR07	15	16	15
BR08	13	13	12
KA09	177	136	133
KA10	30	14	-
KA11	151	97	-
SP02	17	17	21
All plots	1480	1286	1173

Table 3. Number of augmented tree point clouds by species and data acquisition method with the leaf condition in brackets.

Dataset	ALS + FWF	ULS (Leaf-On) + FWF	ULS (Leaf-Off) + FWF	ALS	ULS (Leaf-On)	ULS (Leaf-Off)
FagSyl	790	860	1167	790	880	1245
CarBet	667	735	630	667	840	1088
PicAbi	720	1035	1657	720	1035	1926
PinSyl	684	754	x	684	754	x
PseMen	1008	720	1328	1008	720	1365
QuePet	494	603	969	494	612	1350
QueRub	936	1088	x	936	1088	x

Table 4. Radiometric features derived from the segmented FWF LiDAR point clouds (Unordered).

Name	Symbol	Derived From
Intensity Kurtosis	$I_{k u r t}$	FWF
Mean Pulse Width	${\bar{w}}_{p u l s e}$	FWF
Intensity Mean	$I_{m e a n}$	FWF
Intensity Standard Deviation	$I_{s t d}$	FWF
Intensity Contrast	$I_{c o n t r a s t}$	FWF
Echo Width	W	FWF
FHWM	$F H W M$	FWF

Table 5. Geometric features derived from the segmented ALS and ULS LiDAR point clouds (Unordered).

Name	Symbol	Derived From
Point Density	$ρ_{points}$	ALS/ULS
Leaf Area Index	$L A I$	ALS/ULS
Crown Shape Indices	$I_{j} (crown)$	ALS/ULS
Point Density for Normalized Height Bin j	$ρ_{j}$	ALS/ULS
Relative Clustering Degree	$R_{NN}$	ALS/ULS
Average Nearest Neighbor Distance	${\bar{d}}_{NN}$	ALS/ULS
Canopy Closure	$C_{closure}$	ALS/ULS
Entropy of Height Distribution	$H_{entropy}$	ALS/ULS
Crown Volume	$V_{crown}$	ALS/ULS
Canopy Surface-to-Volume Ratio	$S V R$	ALS/ULS
Equivalent Crown Diameter	$D_{eq}$	ALS/ULS
Fractal Dimension (k = 2)	$D_{f}$	ALS/ULS
Main Component (PCA) Eigenvalues	$λ_{1 / 2}$ , $λ_{2 / 3}$	ALS/ULS
Linearity	$L_{points}$	ALS/ULS
Sphericity	$S_{points}$	ALS/ULS
Planarity	$P_{points}$	ALS/ULS
Maximum Crown Diameter	$D_{\max}$	ALS/ULS
Height Kurtosis	$K_{H}$	ALS/ULS
Height Skewness	$S_{H}$	ALS/ULS
Height Standard Deviation	$H_{std}$	ALS/ULS
Leaf Inclination	$θ_{leaf}$	ALS/ULS
Convex Hull Compactness	$C_{hull}$	ALS/ULS
Crown Asymmetry	$A_{crown}$	ALS/ULS
Leaf Curvature	$κ_{leaf}$	ALS/ULS
N-th Percentile of Height Distribution	$P (Z, n)$	ALS/ULS
Canopy Cover Fraction	$C_{f}$	ALS/ULS
Canopy Ellipticity	$E_{canopy}$	ALS/ULS
Gini Coefficient for Height Distribution	$G (H)$	ALS/ULS
Branch Density	$ρ_{branch}$	ALS/ULS
Height Variation Coefficient	$H V C$	ALS/ULS
Crown Symmetry	$S_{c r o w n}$	ALS/ULS
Crown Curvature	$C_{c r o w n}$	ALS/ULS
Canopy Width x and y	$W_{x}, W_{y}$	ALS/ULS
Density Gradient	$G_{d e n s i t y}$	ALS/ULS
Surface Roughness	$S R$	ALS/ULS
Segment Density for Height Bin i	$S e g D e n s_{i}$	ALS/ULS

Table 6. Architectural hyperparameters selected based on a majority vote across all available subsets of data with the BayesianOptimization tuner.

Hyperparameter	Selected Value
PCE Depth	3
PCE Convolution Filters	256
PCE Number of NN	24
PCE MSG Radii	0.055, 0.135, 0.345, 0.525, 0.695
EMM Dense Units	512
Classification Head Projection Units	128
Classification Head Depth	4
Classification Dense Units	512

Table 7. Comprehensive overview of the metrics suggested by Zhang et al. [13], which are calculated from the confusion matrices of each data subset after the training of MMTSCNet.

T P_{i}

represents all true positive samples of the tree species i,

F P_{i}

represents all false positive samples of the tree species i,

T N_{i}

represents all true negative samples of the tree species i and

F N_{i}

represents all false negative samples of the tree species i.

T S_{i}

is the total sample size of species i and

P S_{i}

is the predicted sample size for species i.

Table 7. Comprehensive overview of the metrics suggested by Zhang et al. [13], which are calculated from the confusion matrices of each data subset after the training of MMTSCNet.

T P_{i}

represents all true positive samples of the tree species i,

F P_{i}

represents all false positive samples of the tree species i,

T N_{i}

represents all true negative samples of the tree species i and

F N_{i}

represents all false negative samples of the tree species i.

T S_{i}

is the total sample size of species i and

P S_{i}

is the predicted sample size for species i.

Metric Name	Formula
MAP (Macro Average Precision)	$\frac{1}{n} \sum_{i = 1}^{n} \frac{T P_{i}}{T P_{i} + F P_{i}}$
MAR (Macro Average Recall)	$\frac{1}{n} \sum_{i = 1}^{n} \frac{T P_{i}}{T P_{i} + F N_{i}}$
MAF (Macro Average F1-Score)	$\frac{1}{n} \sum_{i = 1}^{n} \frac{2 T P_{i}}{2 T P_{i} + F P_{i} + F N_{i}}$
OA (Overall Accuracy)	$\frac{\sum_{i = 1}^{n} T P_{i} + T N_{i}}{\sum_{i = 1}^{n} T P_{i} + T N_{i} + F P_{i} + F N_{i}}$
Cohens’ Kappa Score	$\frac{O A \times {(\sum_{i = 1}^{n} T S_{i})}^{2} - \sum_{i = 1}^{n} T S_{i} \times P S_{i}}{{(\sum_{i = 1}^{n} T S_{i})}^{2} - \sum_{i = 1}^{n} T S_{i} \times P S_{i}}$

Table 8. Overall Accuracy (OA), Macro Average F1-Score (MAF), Macro Average Precision (MAP), Macro Average Recall (MAR), and Kappa Coefficient derived from the confusion matrices of the test data for the six data subsets in Figure 5. The subset for which MMTSCNet achieved the highest classification performance is highlighted.

Dataset	OA	MAF	MAP	MAR	Kappa Coefficient	Species
ALS	0.928	0.923	0.929	0.928	0.915	7
ALS + FWF	0.966	0.966	0.967	0.966	0.960	7
ULS Leaf-On	0.915	0.915	0.917	0.915	0.900	7
ULS Leaf-On + FWF	0.957	0.958	0.957	0.957	0.949	7
ULS Leaf-Off	0.927	0.928	0.929	0.927	0.908	5
ULS Leaf-Off + FWF	0.954	0.952	0.956	0.955	0.941	5

Table 9. Metrics for each available tree species across all experiments conducted with MMTSCNet, which were derived from the confusion matrices in Figure 5. The species with the highest F1-Score, Recall, and Precision for each data subset are highlighted.

Data Subset	Metric	CarBet	FagSyl	PicAbi	PinSyl	PseMen	QuePet	QueRub
ALS	F1-Score	0.94	0.91	0.96	0.93	0.96	0.78	0.95
	Precision	0.89	0.91	0.95	0.93	0.97	0.78	0.99
	Recall	1.00	0.91	0.98	0.94	0.95	0.77	0.91
ALS + FWF	F1-Score	0.99	0.94	0.97	0.97	0.97	0.93	0.99
	Precision	0.98	0.93	0.95	1.0	0.97	0.98	0.98
	Recall	1.00	0.95	0.99	0.95	0.97	0.89	1.00
ULS Leaf-On	F1-Score	0.88	0.87	0.95	x	0.98	0.93	x
	Precision	0.85	0.87	0.96	x	0.98	0.94	x
	Recall	0.92	0.88	0.94	x	0.97	0.92	x
ULS Leaf-Off + FWF	F1-Score	0.86	0.96	0.95	x	0.98	1.00	x
	Precision	0.82	0.95	0.98	x	0.96	1.00	x
	Recall	0.90	0.92	0.93	x	1.00	1.00	x
ULS Leaf-On	F1-Score	0.86	0.88	0.95	0.91	0.92	0.90	0.96
	Precision	0.84	0.88	0.93	0.90	0.94	0.98	0.95
	Recall	0.88	0.89	0.97	0.91	0.90	0.83	0.97
ULS Leaf-On + FWF	F1-Score	0.92	0.90	0.95	0.99	0.95	0.98	1.00
	Precision	0.91	0.92	0.93	0.99	1.00	0.96	1.00
	Recall	0.94	0.88	0.98	1.00	0.90	1.00	1.00

Table 10. Comparison of the results of the tree species classification on the dataset by Weiser et al. [31]. The results for the PointNet++ variants with Farthest Point Sampling (FPS), Random Sampling (RS), Grid Average Sampling (GAS), Non-Uniform Grid Sampling (NGS), K-means Sampling (KS), as well as for the DSTCN were published by Zhang et al. [13]. The results of PointNet++ with Non-Uniform Grid and Farthest Point Sampling (NGFPS) were published by Liu et al. [20]. The highest values for each data subset are highlighted.

Data Subset	Model	OA	MAF	MAP	MAR	Kappa Coefficient	Species
ALS	PointNet++ (FPS)	0.83	0.82	0.83	0.82	0.80	7
ALS	PointNet++ (RS)	0.83	0.83	0.83	0.83	0.80	7
ALS	PointNet++ (GAS)	0.85	0.84	0.85	0.84	0.82	7
ALS	PointNet++ (NGS)	0.85	0.85	0.86	0.85	0.83	7
ALS	PointNet++ (KS)	0.86	0.86	0.87	0.86	0.84	7
ALS	DSTCN	0.94	0.94	0.95	0.95	0.93	7
ALS	MMTSCNet	0.93	0.92	0.93	0.93	0.91	7
ALS + FWF	MMTSCNet	0.97	0.97	0.97	0.97	0.96	7
ULS Leaf-On	PointNet++ (NGFPS)	0.90	x	x	x	0.86	4
ULS Leaf-On	MMTSCNet	0.92	0.92	0.92	0.92	0.90	7
ULS Leaf-On + FWF	MMTSCNet	0.96	0.96	0.96	0.96	0.95	7
ULS Leaf-Off	PointNet++ (NGFPS)	0.89	x	x	x	0.84	4
ULS Leaf-Off	MMTSCNet	0.93	0.93	0.93	0.93	0.90	5
ULS Leaf-Off + FWF	MMTSCNet	0.95	0.95	0.96	0.96	0.94	5

Table 11. Classification performance metrics calculated for the various instances of MMTSCNet with branches or the DMS module disabled. PCE denotes the Point Cloud Extractor branch, FIP the Frontal Image Processing branch, TDIP the Top-Down Image Processing branch, EMM the Enhanced Metrics Model branch, and DMS the Dynamic Modality Scaling module.

Active Modules	Disabled Modules	OA	MAF	MAP	MAR	Kappa Coefficient
DMS, PCE, FIP, TDIP, EMM	x	0.97	0.97	0.97	0.97	0.96
DMS, PCE, FIP, TDIP	EMM	0.84	0.79	0.81	0.81	0.75
DMS, PCE, FIP, EMM	TDIP	0.88	0.88	0.89	0.88	0.73
DMS, PCE, TDIP, EMM	FIP	0.90	0.89	0.90	0.89	0.80
DMS, PCE, EMM	FIP, TDIP	0.90	0.89	0.90	0.89	0.74
DMS, PCE, FIP	TDIP, EMM	0.86	0.83	0.83	0.84	0.90
DMS, PCE, TDIP	FIP, EMM	0.63	0.58	0.62	0.62	0.22
DMS, PCE	FIP, TDIP, EMM	0.48	0.45	0.58	0.48	0.08
PCE, FIP, TDIP, EMM	DMS	0.82	0.83	0.86	0.83	0.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vahrenhold, J.R.; Brandmeier, M.; Müller, M.S. MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds. Remote Sens. 2025, 17, 1304. https://doi.org/10.3390/rs17071304

AMA Style

Vahrenhold JR, Brandmeier M, Müller MS. MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds. Remote Sensing. 2025; 17(7):1304. https://doi.org/10.3390/rs17071304

Chicago/Turabian Style

Vahrenhold, Jan Richard, Melanie Brandmeier, and Markus Sebastian Müller. 2025. "MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds" Remote Sensing 17, no. 7: 1304. https://doi.org/10.3390/rs17071304

APA Style

Vahrenhold, J. R., Brandmeier, M., & Müller, M. S. (2025). MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds. Remote Sensing, 17(7), 1304. https://doi.org/10.3390/rs17071304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MMTSCNet: Multimodal Tree Species Classification Network for Classification of Multi-Source, Single-Tree LiDAR Point Clouds

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Sites and Datasets

2.2. Data Preprocessing

2.2.1. Point Cloud Augmentation

2.2.2. Generation of Bidirectional Depth Images

2.2.3. Extraction of Numerical Features

2.2.4. Feature Selection

2.3. MMTSCNet Architecture

2.3.1. Point Cloud Extractor Branch

2.3.2. 2D Feature Extraction Branches

2.3.3. Numerical Feature Extraction Branch

2.3.4. Classification Head

2.4. Hyperparameter Tuning and Training

2.5. Other Architectures for Evaluation

2.6. Accuracy Assessment

3. Results

Ablation Study

4. Discussion

4.1. Discussion of Our Results

4.2. Comparison to Other Approaches

5. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI