1. Introduction
Forests are an important natural resource on Earth, and their management is crucial for both human well-being and ecological health [
1]. Forest inventory and the extraction of forest structural parameters are the primary tasks in forest management. In recent years, remote-sensing technology has developed rapidly, and the use of unmanned aerial vehicles (UAVs) to collect remote-sensing images has gradually become the main way to obtain forest remote-sensing data. UAV-based remote-sensing image acquisition platforms have the characteristics of low cost and easy portability [
2], and researchers can obtain data from forest land in the study area on a large scale in a short period of time [
3]. However, traditional remote-sensing image data can only obtain horizontal structural information and cannot obtain information on the vertical structure level [
4], which is crucial for extracting various single-tree parameters. Currently, three-dimensional point cloud data are widely used in forest inventory. Compared with 2D remote-sensing image data, point cloud data provide richer three-dimensional information, which can effectively collect the morphological characteristics of trees and improve the accuracy of forest inventory. Therefore, the collection of high spatial resolution aerial three-dimensional remote-sensing data based on unmanned aerial UAV platforms have gradually become an effective technology for conducting forest surveys, thanks to its wide coverage and regular data collection cycles [
5].
In forest inventory, if high-precision single-tree point cloud information can be separated from a large amount of point cloud data, it will be of great significance for subsequent extraction of single-tree structural parameters and provide strong support for subsequent forest biomass inversion and forest three-dimensional model construction [
6]. Existing point cloud-based single-tree extraction methods are generally divided into two categories: one is a raster-based single-tree extraction method, which converts raw point cloud data into rasterized images and usually requires the generation of CHM or DSM models to find local maxima and treat them as the tops of trees. Currently, raster-based point cloud segmentation methods are generally single-tree point cloud extraction algorithms that are improved for specific application scenarios; those commonly used include the watershed algorithm [
7], region growing algorithm [
8], their improved algorithms [
9,
10,
11], etc. Although raster-based methods are relatively mature and fast, they cannot make full use of the three-dimensional characteristics of point clouds and are easily affected by nontree objects during single-tree segmentation, leading to a decrease in detection accuracy [
12].
The other type of method uses three-dimensional algorithms to directly extract single-tree point clouds, which can better utilize the three-dimensional characteristics of point clouds. Wang et al. (2008) [
13] used voxel spacing algorithms to segment the entire study area into small research units using grid networks, realizing the structural analysis of forest point clouds in the vertical direction of the tree canopy and segmentation of three-dimensional single-tree point clouds. Reitberger et al. (2009) [
14] used a combination of trunk detection and normalized segmentation algorithms to achieve three-dimensional segmentation of single trees, and compared with the standard watershed segmentation program, and the result was higher by 12% in the best case. Gupta et al. (2010) [
15] improved the K-means [
16] algorithm using external seed points and reduced height to initialize the clustering process achieving the extraction of single-tree point clouds. Hu et al. (2017) [
17] proposed an adaptive kernel bandwidth Meanshift [
18] algorithm, which realized single-tree point cloud segmentation of evergreen broadleaf forests. Ayrey et al. (2017) [
19] proposed a layer stacking algorithm to slice the forest point cloud in layers at specific height intervals, obtain the single-tree contours in each layer and synthesize them, and finally obtain the complete single-tree point clouds. Tang et al. (2022) [
20] proposed an optimized Meanshift algorithm that integrates both the red-green-blue (RGB) and luminance-bandwidth-chrominance model (i.e., YUV color space) for identifying oil tea fruit point clouds.
In summary, the current research on single-tree point cloud segmentation is constantly advancing, and its accuracy is continuously improving. However, most of these research methods are clustering-based and do not utilize the color information of point clouds, which is only applicable to pure forest scenes and has a narrow application range. Point cloud clustering usually uses only its characteristic attributes, such as density, distance, elevation, intensity, etc., to segment the point clouds with different attributes. Currently, the main point cloud clustering algorithms include OPTICS [
21], Spectral Clustering [
22], Meanshift, DBSCAN [
23], K-means, etc. With the development of UAV data acquisition platforms and the widespread use of high-performance and large-capacity computers, the use of RGB point cloud data is becoming more and more common, and the quality of detail expression in complex scenes is also increasing. Currently, these high-quality point clouds only exist as visualization products, because clustering methods are unable to automatically learn the features of the data like deep learning does, thereby failing to effectively represent the complex relationships between the data. Therefore, there is an urgent need to explore deep learning methods for better performance when dealing with data such as RGB point clouds.
To address this issue, this paper combines the point cloud clustering algorithm Meanshift with the deep semantic segmentation network Improved-RandLA-Net to achieve single-tree extraction from RGB point clouds in complex scenes. This paper studies three experimental areas, which include Ginkgo data and Pecan data in Lin’an District, Hangzhou City, Zhejiang Province, China, as well as Fraxinus excelsior data from the Orlando Convention Center in the STPLS3D [
24] dataset. All data are obtained from photogrammetry. Ultimately, our method achieved good results in different types of scenes. The innovative points of this paper are: (1) use of a single-tree point cloud extraction method based on deep semantic segmentation-clustering; and (2) the improvement of the point cloud semantic segmentation network RandLA-Net into Improved-RandLA-Net, which improves the accuracy of point cloud semantic segmentation.
2. Materials
2.1. Study Area Overview
The experimental areas are located in Lin’an District, Hangzhou City, Zhejiang Province (118°51′~119°52′E, 29°56′~30°23′N). Lin’an is located in the western part of Hangzhou City and has a central subtropical monsoon climate. The warm and humid conditions, abundant sunshine, plentiful rainfall, and distinct seasons are very favorable for plant growth. The area is characterized by a diverse range of plant species, including Ginkgo, Cinnamomum camphora, Lin’an Pecan, Liriodendron chinense, Zelkova serrata, Torreya jackii, and others. Based on the distribution of major tree species and differences in forest density in the study area, two experimental areas were selected for this study within the region, focusing on Ginkgo and Lin’an Pecan. Data were collected from Zhejiang A&F University Donghu Campus and Xiguping Forest Farm, as shown in
Figure 1. In addition, a portion of the STPLS3D public dataset was selected as an experimental area to evaluate the effectiveness of the proposed method.
In this paper, the experimental areas were divided into three, the Ginkgo dataset (Area 1), the Lin’an Pecan dataset (Area 2), and the Fraxinus excelsior data from the STPLS3D dataset (Area 3).
Area 1 was collected on 12 July 2022, on the campus of Zhejiang A&F University’s Donghu Campus under clear and well-lit conditions. Area 2 was collected on 9 October 2021, within the Lin’an Xiguping Forest Farm. All flights were conducted at a height of 50 m with a lateral overlap and longitudinal overlap of 85% under clear and well-lit conditions.
2.2. UAV Data Acquisition System
In this paper, we utilized the DJI Phantom 4 RTK drone as a data acquisition platform. This device provides real-time centimeter-level positioning data and possesses professional route planning applications. Additionally, equipped with a 20-megapixel RGB high-definition camera, it can provide high-precision data for complex measurement, mapping, and inspection tasks. Further details regarding the drone parameters can be found in
Table 1.
2.3. Dataset Production
The point cloud data in this paper were generated using the UAV mapping software Pix4Dmapper 4.8.4. Visual interpretation and manual labeling of corresponding tree point clouds within the study area were conducted using the point cloud processing software CloudCompare 2.13. and remote-sensing images. During the data set production process, only point clouds of the studied tree species in each experimental area were labeled, while other point clouds were labeled as background. In total, 157 ginkgo trees were obtained in Area 1, including 108 training data and 49 testing data; 166 lin’an pecan trees were obtained in Area 2, including 110 training data and 56 testing data; and 165 fraxinus excelsior trees were obtained in Area 3, including 100 training data and 65 testing data.
All datasets in this study were produced with reference to the Stanford3dDataset _v1.2_Aligned_Version format standard of the point cloud semantic segmentation public dataset S3DIS, and the point cloud files were txt files containing color and coordinate information (RGBXYZ format).
3. Methods
3.1. Overview
Figure 2 summarizes the entire experimental process, which includes the following steps:
(1) Collected RGB image data and synthesized them into point cloud data based on the SfM method;
(2) Produced and partitioned the dataset;
(3) Input the point cloud data of the training sample into the Improved-RandLA-Net deep point cloud semantic segmentation network to obtain the trained network model;
(4) Used the trained model to perform semantic segmentation on the test sample, and obtain the point cloud segmentation result with classification labels;
(5) Extracted the target tree point cloud based on the semantic segmentation label category;
(6) Used the distance-based point cloud denoising method to denoise the extracted tree point cloud;
(7) Used the point cloud clustering method Meanshift to extract single-tree point clouds.
3.2. Point Cloud Synthesis
Currently, there are two main sources of RGB point cloud data: (1) using algorithms such as SfM to reconstruct 3D point clouds from high-resolution RGB photos; and (2) collecting data simultaneously with a laser scanner and a panoramic camera and matching point clouds and images through complex algorithms to obtain RGB point clouds. The point cloud data used in this paper were obtained through method (1). In this study, we utilized the DJI Phantom 4 RTK UAV for data acquisition, which offer high-precision positioning and measurement capabilities, providing accurate location information. Following data collection, we employed Pix4D software for point cloud generation and 3D reconstruction. Pix4D leverages a combination of the acquired images and photogrammetric equations, utilizing a general geometric correction model. It performs the transformation from image coordinates to three-dimensional coordinates in object space. Additionally, Pix4D automatically calibrates the camera sensor to correct for internal parameters and distortions using geometric correction models such as the Brown model and free-form deformation model, thereby enhancing the accuracy and precision of point cloud generation. By combining the DJI Phantom 4 RTK UAV with Pix4D software, we acquired point cloud data of the experimental area and conducted tree segmentation tasks.
SfM [
25] (Structure from Motion) is an offline algorithm that utilizes a series of highly overlapping images in both horizontal and vertical directions, obtained from different viewpoints, to reconstruct a 3D point cloud [
26]. The basic process involves detecting feature points in each image and matching feature points between pairs of images while preserving geometric constraints. Finally, the SfM method was iteratively executed to reconstruct the point cloud, thus recovering the original pose and scene geometry information of the camera and obtaining the RGB point cloud.
3.3. Deep Semantic Segmentation of Target Tree Point Clouds
The essence of semantic segmentation is classification, which manifests as classifying each point in point cloud data. Semantic segmentation of point clouds will eventually assign a single class to each point, and the classification will be based not only on the color value of each point but also on the location and interrelationship of the points.
With the continuous development of artificial intelligence technology, there are numerous deep learning-based point cloud semantic segmentation algorithms, such as PointNet [
27,
28,
29], PointNet++ [
30,
31,
32], and RandLA-Net [
33,
34,
35]. Among them, the PointNet series is an instance segmentation network, but it is only suitable for small-scale indoor scenes, while RandLA-Net is a semantic segmentation model, which is suitable for large-scale outdoor scenes.
In this paper, a novel deep point cloud semantic segmentation network based on RandLA-Net was constructed using point coordinates (XYZ) and color attributes (RGB) as feature inputs. RandLA-Net is an efficient and lightweight model that can process large-scale point clouds more quickly and accurately than other networks. However, RandLA-Net has certain limitations in feature extraction from point clouds, which may result in the loss of some local features of point clouds [
36]. Although it combines the coordinates and colors of the raw data to enhance feature extraction in the LFA process, the overall improvement is not significant. The main reason is the inability to effectively enhance important features in an adaptive manner, which play a crucial role in the model’s predictions. As trees, the target of single-tree segmentation, are affected by factors such as tree shape and terrain, the model needs to have stronger local feature learning capabilities. Therefore, this paper improved the LFA (local feature aggregation) module in RandLA-Net and proposes the Improved-RandLA-Net model. The improved model employed the random sampling (RS) and IMP-LFA (Improved Local Feature Aggregation) modules to effectively enhance and extract the channel features and spatial features of the input point cloud data through the attention mechanism [
37,
38], thus significantly improving the ability of the model to extract local features of the point cloud and the accuracy of the final semantic segmentation. The specific structure of the model is shown in
Figure 3.
3.3.1. Deep Point Cloud Semantic Segmentation Network
The structure of the backbone inherited from RandLA-Net is illustrated in
Figure 3, comprising four encoding–decoding processes, each corresponding to a dashed line connection. The input data underwent four encoding layers, each supported by an RS and an IMP-LFA, reducing the number of input points to one-quarter while increasing the feature vector’s dimension to four times its original size. Subsequently, the data passed through four decoding layers that employed the K-Nearest Neighbors (KNN) algorithm. Semantic segmentation was then performed, and the output comprised the semantic labels of the point cloud.
3.3.2. LFA Improvement for Complex Forest Features (IMP-LFA)
The LFA module is the most important mechanism in RandLA-Net, whose primary function is to extract and aggregate the local features of each point. However, the original LFA module may lose some features during local feature extraction, resulting in incomplete feature learning by the model and affecting the accuracy of semantic segmentation. To address this issue, this paper proposes the Attentive Chain, which improves the original LFA module to IMP-LFA. As shown in
Figure 3, the IMP-LFA module consists mainly of Local Spatial Encoding (LocSE), Attentive Pooling, and Attentive Chain. In addition to the original channels, the input point cloud passes through SE and CBAM modules on the attention chain, extracting local spatial and channel features and fusing them with features from other channels. SE, whose full name is Squeeze-and-Excitation, is an attention mechanism for enhancing and extracting channel features of the input point cloud. CBAM, whose full name is Convolutional Block Attention Module, is an attention mechanism for enhancing and extracting channel and spatial features of the input point cloud. In this paper, the channel features are represented by RGB, which indicates the color of each point, and the spatial dimension is represented by XYZ coordinates, which indicates the position of each point in the 3D space. The SE and CBAM modules function in the IMP-LFA as follows: First, the point cloud data went through the SE module. Through compression, the channel features were converted into a feature vector, and a convolutional layer was used to generate channel attention weights. These weights represent the importance of each channel across the entire point cloud. Finally, the channel attention weights were multiplied with the original channel features, resulting in a weighted channel feature map. This weighting process enabled the model to pay more attention to significant channel features, thereby enhancing feature extraction capabilities. Building upon the SE module, the CBAM module introduced spatial attention mechanisms. Following the channel attention in the SE module, it learned the correlations between different positions in the feature map through global average pooling and global maximum pooling operations. Subsequently, a set of fully connected layers generated a spatial attention weight map, which represents the importance of different positions. Next, the channel attention weights and spatial attention weights were applied to the feature map separately, producing the final weighted feature map. This weighting process enabled the model to better focus on essential channel features and spatial positions, further enhancing feature extraction capabilities. Finally, the point cloud features extracted by SE and CBAM were stitched with the point cloud features extracted using the original network to obtain the aggregated features of the final point cloud. This improved the model’s feature learning ability and semantic segmentation accuracy, while also effectively increasing the model’s learning efficiency by extracting richer local features from a single training sample [
39]. This reduced the number of samples required for model training, enabling training and detection of smaller sample data.
3.4. Clustering-Based Extraction of Single-Tree Point Clouds
3.4.1. Point Cloud Denoising
Due to the fact that the semantic segmentation model cannot completely and accurately segment the target tree point cloud, the final segmentation result may contain a small amount of erroneously segmented points, i.e., noise points. In order to obtain more accurate results for single-tree point cloud extraction, this paper employed a point cloud denoising algorithm based on distance statistics [
40] to remove noisy points. The basic principle of this algorithm is to calculate the average distance from each point to all neighboring points. Points with distance values beyond a certain threshold were considered noise points and removed.
3.4.2. Clustering Extraction of Point Clouds
Semantic segmentation resulted in a block of tree point clouds, rather than single-tree point clouds. As a solution, point cloud clustering was utilized for tree point cloud instance segmentation in this study.
In this paper, we used the Meanshift algorithm to extract single-tree point clouds. Meanshift is a density-based clustering algorithm that assumes that data sets of different clusters conform to different probability density distributions, and regions with high sample density correspond to the centers of clusters [
41]. Therefore, when segmenting two slightly connected tree point clouds, the algorithm determines the clustering center based on the density of the point cloud and performs clustering. By utilizing the characteristics of dense and sparse regions in the tree point cloud, single-tree point clouds could be segmented accurately.
3.5. Evaluation Indexes
To better evaluate the performance of the model, this paper proposes different evaluation metrics for two parts: semantic segmentation accuracy of tree point cloud and clustering extraction of tree point cloud. For the semantic segmentation accuracy of tree point cloud,
Accuracy,
Precision,
Recall, and
F1-
score were utilized as evaluation metrics to assess the semantic segmentation results and compare different models. The calculation of each evaluation metric is presented in Equations (1)–(4).
In the equations: TP indicates the number of points that were originally labeled tree points and were correctly predicted; TN indicates the number of points that were originally ground points (points other than tree points) and were correctly predicted; FP indicates the number of points that were originally labeled tree points but were incorrectly predicted as ground points; and FN indicates the number of points that were originally ground points but were incorrectly predicted as labeled tree points.
For the clustering extraction of tree point cloud, Precision was selected as the main evaluation metric, which reflects the completeness of single-tree point cloud extraction, and was calculated as shown in Equation (2). Meanwhile, Average Precision (AP) was calculated based on Precision, and Correct Quantities, Recognition Rate, Missing Quantities and Missing Rate were calculated according to Equations (5)–(8).
In the equations: Correct Quantities, meaning samples with Precision greater than or equal to 50% were considered as correctly segmented samples; Incorrect Quantity, meaning samples with Precision less than 50% were considered as incorrectly segmented samples; and Missed samples, meaning samples with Precision less than 5% were considered as Missed samples, which were used to calculate the Missing Quantities and the Missing Rate.
3.6. Experimental Environment and Parameters
The experimental setup in this paper used an Intel Core i7-12700K CPU, 32 GB RAM, and an NVIDIA RTX 3080 12 GB graphics card. The deep learning framework used includes CUDA 11.3, Python 3.6, and Tensorflow-gpu 2.6.0. The parameters of the experiments were set as follows: in the semantic segmentation part, all the experimental parameters were set to be uniform, the nearest neighbor value (K value) was 16, the size of a single input point cloud was 40,960 points, the initial learning rate was 0.01, momentum was 0.95, and the number of model iterations was 100. In the clustering extraction part, the parameters for point cloud denoising were neighbors = 400, ratio = 3, and the main parameters of Meanshift were composed of bandwidth and calculated through quantity and samples. The unified quantity was 0.0115, while samples were affected by the point cloud density, with samples = 1800 in Area 1 and Area 2 and 400 in Area 3.
5. Discussion
5.1. Evaluation of Our Approach
This paper proposes a novel method for extracting point clouds of single trees of a specified species from RGB point clouds. The method uses the deep point cloud semantic segmentation network Improved-RandLA-Net, which can recognize tree species well and extract the specified species’ point clouds. Then, point cloud clustering is used to segment the extraction results into single-tree point clouds. The proposed method achieved good segmentation results in three different experimental areas, indicating its feasibility and high robustness for single-tree point cloud extraction.
However, there are still some limitations in this work. For example, the datasets used in this paper are manually labeled by researchers, which is a very subjective process. Although most of the point labels in tree annotation are correct, there are still some errors in the areas where the trunk connects to the ground and where the tree crowns overlap. This can only be addressed by combining visual interpretation by researchers with the ortho-image.
5.2. Improved-RandLA-Net and RandLA-Net
This paper presents an improved version of the point cloud semantic segmentation network, RandLA-Net, known as Improved-RandLA-Net. The study demonstrates its superior performance compared to the original network in three experimental areas discussed in the paper. Notably, a significant improvement is observed in Area 2, which represents the Lin’an Pecan sample site. Unlike the other two relatively flat terrains, Area 2 exhibits substantial variations in topography, posing challenges for semantic segmentation. In certain instances, tree points and ground points align horizontally, making it difficult to distinguish between them. Additionally, the presence of dense grass covering the ground further complicates the segmentation task as the color of the ground closely resembles that of the trees. Consequently, the original network fails to achieve satisfactory semantic segmentation results. In contrast, Improved-RandLA-Net enhances the extraction of channel and spatial features by incorporating an Attention Chain mechanism, effectively capturing local point cloud features and improving segmentation accuracy.
5.3. Compared with Traditional Methods
Currently, most research on single-tree segmentation of point cloud data focuses on LiDAR-acquired data without color information. The main research steps include ground point removal, point cloud denoising, and single-tree segmentation. Common ground point removal methods, such as the RANSAC algorithm [
42], Progressive TIN Densification [
43], etc. perform well on relatively flat terrain but have limitations in areas with significant elevation differences. These methods struggle to preserve ground points on steep slopes or cliffs and may incorrectly identify other points as ground points, leading to inaccurate single-tree segmentation. Moreover, using point cloud clustering alone to extract complex terrain often leads to unsatisfactory results. For example, in the experimental area of this study, the results extracted by clustering were poor and therefore cannot be effectively analyzed.
To address these limitations, this paper proposes a novel approach that leverages RGB point cloud data and a deep point cloud semantic segmentation network to better learn the local features of tree point clouds. This approach effectively classifies and extracts tree point clouds, achieving a tree point cloud extraction rate of over 90% across three experimental areas. Additionally, the paper proposes a new single-tree point cloud segmentation method that combines point cloud clustering to solve some of the problems associated with traditional methods.
5.4. Future Research Directions
In future work, this project will focus on the following key areas: (1) Measurement of single-tree canopy volume. Compared to commonly used metrics, such as canopy width and area, canopy volume better reflects the ecological function of a tree. Based on this study, the calculation and validation of canopy volume can be successfully achieved. (2) By merging the forest canopy point cloud and the forest floor point cloud to create a complete point cloud, single-tree segmentation can be performed to obtain more single-tree structural parameters. In reality, tree density can be high, and collecting point clouds from above the canopy alone may not provide a complete representation of the tree. To address this issue, we aim to synthesize complete point clouds of the scene by collecting data from both above and beneath the tree canopy. These point clouds will be utilized to perform single-tree point cloud segmentation of multiple tree species, and extract their respective deconstruction parameters. (3) Classification study of multiple tree species. The method employed in this study has been verified to distinguish between trees and nontrees. Further verification of its ability to differentiate between different tree species would greatly advance the application of RGB point clouds.