Single-Tree Detection in High-Resolution Remote-Sensing Images Based on a Cascade Neural Network

: Traditional single-tree detection methods usually need to set different thresholds and parameters manually according to different forest conditions. As a solution to the complicated detection process for non-professionals, this paper presents a single-tree detection method for high-resolution remote-sensing images based on a cascade neural network. In this method, we ﬁrstly calibrated the tree and non-tree samples in high-resolution remote-sensing images to train a classiﬁer with the backpropagation (BP) neural network. Then, we analyzed the differences in the ﬁrst-order statistic features, such as energy, entropy, mean, skewness, and kurtosis of the tree and non-tree samples. Finally, we used these features to correct the BP neural network model and build a cascade neural network classiﬁer to detect a single tree. To verify the validity and practicability of the proposed method, six forestlands including two areas of oil palm in Thailand, and four areas of small seedlings, red maples, or longan trees in China were selected as test areas. The results from different methods, such as the region-growing method, template-matching method, BP neural network, and proposed cascade-neural-network method were compared considering these test areas. The experimental results show that the single-tree detection method based on the cascade neural network exhibited the highest root mean square of the matching rate (RMS_R mat = 90%) and matching score (RMS_M = 68) in all the considered test areas.


Introduction
Reliable information concerning a forest is required to perform extensive forest management, as well as for planning purposes to maintain sustainable forestry. With the increasing availability of high-spatial-resolution data and computational power, a growing amount of remote-sensing research on forestry focused on detecting and measuring individual trees as opposed to obtaining stand-level statistics. High-resolution satellite remote-sensing imagery is currently one of the most widely used types of data in forestry applications [1]. Today, many remote-sensing satellites can obtain sub-meter remote-sensing images; these satellites include Orbview5, WorldView, and QuickBird-2 of the United States, EROS-B and EROS-C of Israel, and Gaofen-2 of China. The color and contour features of trees, which cannot be observed in low-resolution remote-sensing images, can be observed in the high-resolution remote-sensing images.
Currently, there is widespread interest among many researchers regarding the detection of individual trees and gathering forest information from digital aerial photographs or high-resolution remote-sensing images, and several researchers proposed automatic or semi-automatic single-tree detection methods. The conventional methods of tree detection can be mainly divided into two categories.
One method involves tree detection based on pixels. For example, the local-maximum method [2][3][4] extracts the maximum value of a local area as the center point of a tree. In addition, the algorithm combines region growing, watershed segmentation, and other methods to detect a single tree. Novotný and Hanuš et al. [5] proposed a method of local maxima with variable window sizes, and they used seed region growing methods to detect individual trees. Hirschmugl et al. [6] firstly compared the different methods of obtaining the coronal center, and then proposed a deformation algorithm to determine coronal centers. However, the methods based on the local maximum cannot make full use of the overall characteristics of the tree. Therefore, the selection of seed points in a complex background significantly affects the accuracy. The template-matching algorithm [7][8][9] uses the template of the tree to compare the square error of all the pixels with the same size of the image in sequence. However, the template-matching method is not suitable for areas in which the trees are located in a crowded manner and the canopies often overlap, because many trees cannot be detected.
The other category corresponds to object-based tree-detection methods, which gradually incorporate machine-learning algorithms. For example, Salim Malek et al. [10] extracted a set of candidate key points of a palm farm using the scale-invariant feature transform (SIFT) and analyzed these key points with a recent kernel-based classification method termed as extreme learning machine (ELM). However, SIFT selects the characteristics of several key points in the sample, and it is less competitive than the method based on the global features of red, green, and blue channels In addition, Lin Yang et al. [11] trained a pixel-level classifier for each pixel in the aerial image based on a set of visual features, and introduced methods for model and data selection based on two-level clustering. However, these methods require artificially setting a large number of parameters for different scenes, which is extremely difficult without prior knowledge.
Overall, there are some problems with the existing single-tree detection methods. Firstly, these methods have a strong parameter dependency. They usually need experts to set various parameters in advance according to the forestlands. Secondly, the detection performance differs greatly across different forest types, and the generalization ability of these methods is weak. For example, the region-growing method can obtain the best detection results in the case of mixed and dense forests, but the detection results in the case of an isolated forest are much worse than those by other detection methods.
The detection of individual trees in high-resolution remote-sensing images is typically a target-recognition problem. Owing to the advantages of the cascade neural networks [12,13], such as a strong ability in nonlinear mapping, fast convergence, and good fault tolerance, these networks achieved great success in dealing with image-identification problems. To reduce the dependence of the detection method on prior knowledge and improve the generalization performance of the classification model for different scenes, this paper presents a single-tree detection method for high-resolution remote-sensing images based on a cascade neural network. Unlike the methods based on a pixel analysis, the proposed method based on a set of pixels can take the overall characteristics of trees into account. Firstly, this model calibrated many tree and non-tree samples in high-resolution remote-sensing images and used these samples to train the classic backpropagation (BP) neural network model [14][15][16]. The neural network at the first stage can perform the nonlinear characterization of tree features and provide a preliminary classifier. To further improve the accuracy of the classifier, we analyzed the statistical characteristics of trees and designed a BP neural network in the second stage. The second network input layer includes both the output of the first BP network and the statistical characteristics of the tree samples on the three RGB channels.

Materials
Google Earth is a virtual-earth software application that renders a simulacrum of the Earth based on satellite imagery. In addition, Google Earth images are easily available, and this can have large implications for forest management and land applications. The remote-sensing images used in this study are WorldView pan-sharpened imagery. They involve red, green, and blue channels, with a resolution of 0.31 m. Because of concerns regarding transferability, we processed the satellite-derived images directly from Google Earth without radiometric (radiance and reflectance) or sun-glint corrections.
For conducting the comparative experiment to prove the effectiveness of the proposed method on single-tree detection, we chose six uncalibrated forest areas as the test areas. Because field measurement data for these test areas cannot be obtained, we considered the artificial visual interpretation of six volunteers to calibrate the trees as the ground truth.
The forestlands considered in this study are located in China and Thailand. Figure 1 shows the satellite imagery and corresponding reference data of all six test areas; here, the test area in every image was defined using a yellow line, as the trees outside the yellow line are difficult to be assessed by the human eye. The latitude and longitude coordinates of these six test areas are shown in Figure 1. The left column of Figure 1 is the RGB image, and the right column of Figure 1 is the corresponding reference data.
Test areas 1 and 2 are located in Thailand, and the main tree species in these two areas is oil palm, an important economic crop in Thailand. There are 801 and 179 reference trees in test areas 1 and 2, respectively. Test area 3 is located in Hangzhou, China. This area is relatively complicated. There are not only different trees, but also rivers, buildings, and lawns in this area. There are 312 reference trees in test area 3. Test area 4 is located in Shaoxing, China. This area mainly consists of some small seedlings, and the trees are common varieties. There are 338 reference trees in this area. Test area 5 is located in Hangzhou, China, and mainly consists of red maples. The photo was taken in autumn; thus, the leaves were red. There are 341 reference trees in test area 5. Test area 6 is located in Dongguan, China, and the main tree species of this area is longan. The forest density in this area is relatively high; thus, the tree crowns overlap each other. There are 521 reference trees in test area 6.

Method
The flowchart of the single-tree detection method for high-resolution remote-sensing images based on a cascade neural network is shown in Figure 2. Firstly, we selected different types of forestlands from the high-resolution remote-sensing images and calibrated the representative tree and non-tree samples for different forest types. Secondly, we normalized the samples to the same size to ensure a uniform input layer size of the neural network, and we calculated the first-order statistical features of the samples, such as the energy, entropy, mean, skewness, and kurtosis. Finally, the neural-network model was trained with these samples and features, until the errors in the desired output and the actual output met the requirements. After neural-network learning, the trained neural-network model could be adopted as a classifier to detect single trees for different forests.

Sample Calibration
To obtain a classifier that can accurately distinguish between a tree and non-tree, the neural network must be trained using manually calibrated tree and non-tree samples. Because there are many types of trees in the remote-sensing images, such as isolated trees, sticky trees, larger trees, smaller trees, and trees under shadows, each tree type must be considered. Therefore, sample calibration requires a large number of positive (tree) and negative (non-tree) samples. In our study, a total of 849 positive samples and 848 negative samples were calibrated. An example of positive and negative sample calibrations is shown in Figure 3. Manual calibration sampling takes a large amount of time and effort. However, more samples can achieve better training results and stronger generalization ability of the neural-network model; thus, it was necessary to expand the limited number of manually calibrated samples. Therefore, we used the data-argument technology [17,18] to increase the number of samples. The calibration samples were manipulated using left and right mapping, with a left and right rotation of 15 degrees each; thus, the number of positive and negative samples was four times the number of original samples. Finally, we obtained 3396 positive and 3392 negative samples. An example of the sample extension is shown in Figure 4. To unify the number of neurons in the input layer, all positive and negative samples were required to be resized to the same size. The specified size is usually the average size of the individual trees in the forestlands; for example, the value was 25 × 25 pixels for our experiment.
After normalizing the individual tree samples, we divided the samples into three separate sets for the neural-network training: 50% of the samples were the training set, 25% were the validation set, and the remaining 25% were the test set. The training set was used to train the model, the validation set was used to determine the final parameters of the control network, and the test set was used to evaluate the single-tree detection method.

Training Samples Using the BP Neural Network at the First Stage
The BP neural network is generally a three-layer network, involving the input layer, hidden layer, and output layer. In our method, the sigmoid function, f(x) = 1/(1 + e −x ), was chosen as the activation function of the neural network. The activation function maps the combination of neurons and bias non-linearly to enhance the expressiveness of the network. Firstly, the feed-forward transmission is adopted to reconstruct the network and update the parameters. The aim of feed-forward transmission is to achieve the representation of the original data of the input layer as much as possible. The error of the output's direct front layer can be estimated using a backpropagation algorithm in the process. Then, the error can be used to estimate the further layer and achieve the error estimation for other layers by sequentially performing backpropagation. The feedback error is used to update weights.
In the training of a BP neural network, it is necessary to firstly determine the number of neurons in each layer of the neural network, as shown in Figure 5. The sample image in this method is a single patch of 25 × 25 pixels of grayscale image, and, since the input layer needs a bias, the number of neurons in the input layer is 626. Since the identification of a tree is a binary classification problem, the number of neurons in the output layer is only one. If the output has a value of zero, a tree does not exist in the input image; otherwise, a tree is present in the input image. After determining the number of neurons for the input and output layers, the number of neurons in the hidden layer must be determined. To determine the number of neurons in the network hidden layer, we performed an experiment concerning the different numbers of neurons. In the field of machine learning, and specifically, in statistical-classification problems, a confusion matrix [19], also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm. We split all samples into a training set, validation set, and test set. Accordingly, for the different network structures, we could obtain the confusion matrices of the training set, validation set, test set, and all sample datasets to evaluate the classification capabilities of the network. We selected different numbers of neurons for the hidden layer, such as 150, 300, and 450 neurons, which represent intermediate values between the number of neurons in the input and output layers.

Calculating First-Order Statistics to Train Samples at the Second Stage
In the first step of training, we used the grayscale image to quickly distinguish the trees from the background without much RGB information. To further improve the recognition rate and reduce the omission rate, the proposed approach combines the training results of a previous BP neural network with the first-order statistics of individual trees as the input for another BP neural network. Since the grayscale image is synthesized by the three RGB channels, some information is lost. The first-order statistics from each color band can improve the recognition rate and reduce the omission rate. The features of first-order statistics, including energy, entropy, mean, skewness, and kurtosis of the trees in samples help determine whether the object is a tree. In this study, we adopted the following first-order statistics as features, where N is the largest grayscale, n is one of the grayscales, and H(n) is a normalized histogram:

Training Samples Using a Cascade Neural Network
This constitutes a two-level cascade neural network to improve the results in single-tree detection for different scenarios.
The number of neurons in the input layer of the second neural network is 17. There are 15 neurons for the five features of energy, entropy, mean value, skewness, and kurtosis in the three RGB bands in a remote-sensing image, one is for the output value of the previous BP neural network, and another is for the bias. The output layer is also used for determining whether the input image involves a tree; thus, only one output neuron is required. The number of neurons in the hidden layer is half of the number of input neurons, specifically, eight neurons. The structure of the cascaded neural network is shown in Figure 6. The gray box in the middle dotted-line box indicates that the first BP network uses the gray information of the samples, and the RGB boxes corresponding to the first-order statistical features indicate that the three channels of RGB information were extracted. In the second level of the network, we mixed the grayscale and RGB information. The cascade-neural-network model is composed of two three-layer BP neural networks; thus, it is a 3-3-layer cascade-neural-network model. In this way, we can further improve the ability of the network to classify trees. In addition, the numbers of neurons and layers in the network are determined through grid search.

Sliding Window and Redundant Sample Removal
To verify the validity of the proposed method, we used the trained classifier to traverse each image through a sliding window to find single trees, where the sliding window size varied from 17 × 17 to 33 × 33, according to the tree size in the study areas. If the image in the window was a tree, we marked the window with a label (saving the location and size) to represent tree detection. After traversing the entire image, several labels indicating potential trees are obtained; however, many of them represent the same tree; thus, non-maximal suppression (NMS) technology is used to remove the redundant labels. Firstly, all labels are sorted by their classification probability from high to low. Subsequently, we extract the top label, label now , in the sequence as a detected tree, and remove all the labels in the sequence for which the overlap areas with label now are greater than a threshold. Next, we repeatedly extract the top label, label now , in the sequence and repeat the above process until the sequence is empty. An example of NMS is shown in Figure 8.

Accuracy Evaluation Method
To verify the effectiveness of the proposed single-tree detection method, it is necessary to evaluate the accuracy of the detection results according to the reference data. When the spatial-position difference between the detected tree in the detection result and a ground-truth tree is within a certain range, we can say that the detected tree matches the ground-truth tree, that is, the detected tree is a correct result. A detailed evaluation of the accuracy involves three steps [20]: (1) Candidate tree selection: For a detection tree, the reference trees are added to the candidate set if the horizontal difference ∆D 2D is within a certain threshold range. We set ∆D 2D < 3 m in test areas 1, 2, 3, and 4, and ∆D 2D < 4 m in test areas 5 and 6. (2) Selection of the best candidate tree: We determine the nearest reference tree to the test tree from the candidate set as the best candidate tree. (3) Candidate testing: The matching problem is not a one-way problem. A test tree needs to find the best-matching reference tree. The reference tree also needs to find the best-matching test tree. These two trees are considered as a successful match only when the candidate tree of the test results and the candidate tree of reference data are candidate trees for each other. The accuracy evaluation parameters and calculation method are defined as follows: The root mean square (RMS) of X is defined as the formula below, where X can be A mean , R extr , R mat , R com , R om , and M; i denotes the i-th test area.

The Comparison Results of Different Neurons
The test set is used to measure the generalization performance and classification ability of the optimal model. Therefore, we usually focus on the blue cells in the test set for each neural-network model. According to the training results, when the number of hidden neurons was 150, 300, and 450, the overall accuracy rate of the trained model for the test set was 90%, 94%, and 90.5%, respectively (Figures 9-11). Thus, the model of 300 hidden neurons can achieve the best results. Therefore, our neural network model has 300 hidden neurons. Specifically, the accuracy rate of the model for the training set is 95.2%, and the accuracy rate for the testing set is 94%. In other words, the results for the training set and the testing set exhibit no significant difference, which shows that our BP neural-network model can be applied to more occasions.

The Comparison Results of Different Layers
The training results of the 3-3-layer cascaded network are shown in Figure 12. From the training results, the accuracy of all the samples using the model was 97%. The accuracy rate of the training set was 97%, and the accuracy rate of the testing set was 97.2%. The results exhibit no significant difference; thus, our training model has better generality. The training results of the 3-4-layer cascade-neural-network model ( Figure 13) showed that, in each of the three considered scenarios for the training set, validation set, and testing set, the 3-3-layer cascade-neural-network model achieved better results than the 3-4-layer cascade-neural-network model does, but the difference was very small. This indicates that these two network models achieved nearly comparable results. Therefore, the 3-3-layer cascade-neural-network model (shown in Figure 6) was adopted to detect individual trees for different forests. In our single-tree detection method, a list of rectangular areas is provided, and the rectangular area in the list represents an individual tree.  Figure 14 shows the eigenvalues of the energy, entropy, mean, skewness, and kurtosis for the positive and negative samples in the red band of the images. It can be seen from the figure that the values of energy, entropy, and kurtosis of positive samples were more stable than those of negative samples, and the values were generally smaller than those of negative samples. The fluctuation ranges of the mean and skew of the positive samples were smaller than those of the negative samples.  Figure 15 shows the detection results of the six test areas, where the red rectangle represents a detected tree. To measure the difference between detection results and the ground truth, the area of the tree crown must be estimated. Since the tree crown is approximately round in the remote-sensing images, it is easy to calculate the diameter of the crown. We took the smaller values of length and width of the rectangle for each detected tree as its crown diameter.

Detection Results of Each Test Area
Tables 1-6 present the comparison of detection results using different methods, where the region-growing method, template-matching method, BP neural network, and the cascade neural networks were adopted to detect individual trees in six different test areas. This section compares the results of N test , N match , R extr , R extr , R com , R om , A mean , and M, which have been defined in Section 3.6. In the region-growing method, W represents the size of the sliding window. In the template-matching method, T represents the similarity threshold between the template and the detection tree. These two parameters are very important because W and T significantly influence the detection results of the region-growing and template-matching methods, respectively. If W is too large, then only local maximum points are selected as seed points in the region-growing method; thus, a large number of little trees will be left out, and the leakage rate will be very high. In contrast, if W is too small, the same tree may be repeatedly detected and the computation cost may be very high.
In test areas 1, 2, and 4, the forestland was a large area of planted forest with similar and widely spaced tree species. The data in Tables 1, 2 and 4 show that the proposed method had a better detection score than the other three methods, especially in Table 4 (M = 77). In Table 3, the considered forestland had a complex area, involving trees, buildings, water, and other objects; thus, the detection results of these methods were not satisfactory in test area 3. The region-growing method achieved its best detection result when the size of the sliding window W was 5. The detection result shows that the commission rate of the region-growing algorithm is high. The main reason is that the region-growing algorithm cannot effectively avoid the interference of complex background and several incorrect seed points are selected, thus resulting in low precision. Even in such a scenario, the cascade-neural-network method can still get the highest matching rate (R mat = 82%) and score (M = 52). Therefore, the proposed method has a strong anti-interference ability. However, in Tables 5 and 6, the difference in the detection results of the four methods is not obvious because the trees were similar, and the background was singular in the scenarios of test areas 5 and 6. Even in such a relatively simple scenario, the proposed method had better tree-detection performance. Table 7 presents the overall detection results in the six considered areas. For the region-growing method, the extraction rate was 149%, but the matching rate was low at only 83%. This led to the highest commission rate among these four methods. The template-matching method achieved the lowest extraction rate, matching rate, and commission rate. Its low matching rate led to the highest omission rate of 24%. The cascade neural network could achieve higher detection scores (RMS_M = 68) than the other three methods. Thus, the cascade-neural-network method can achieve the best tree-detection result for different types of forests. According to the above detection results in the six test forestlands, the region-growing method is generally suitable for trees that have clear boundaries. The system is not only able to determine the location and the size of a single tree, but it is also able to depict the outline of the tree crown. The template-matching method is more suitable for applications in a complex forest, because it is less affected by the surrounding environment. Overall, the region-growing method achieved a good matching rate, but the detection score of this method was much lower than those of the other methods. The reason for this lies in the difficulty in selecting the local maximum as the center of trees in the test area. The template-matching method is highly dependent on the quality of the template; thus, the detection effect of different regions may vary greatly. For example, the score of test area 3 was high, but the scores of other regions were very low. The BP neural-network method and cascade neural network exhibit a better performance in all six regions. In particular, in a dense forest, the cascade-neural-network method demonstrated the best performance and achieved the best detection score among all the test areas. In addition, for scenarios with different levels of complexity, our method can have better detection results, which indicates that our method has better generalization ability.

Conclusions
High-resolution remote-sensing images are widely used in high-precision forest resource surveys, forest management, timber production estimations, and other applications. Currently, researchers have developed various methods to extract individual trees and their characteristics in digital aerial photographs of various types. However, the existing single-tree detection methods are heavily dependent on certain features determined by prior knowledge. Furthermore, these methods cannot be applied in forest scenes with different complexities. The model needs to learn the overall characteristics of the trees in the training process, so that the trees of interest are best isolated.
To automatically and effectively identify individual trees, this paper presents a single-tree detection method for high-resolution remote-sensing images based on a cascade BP neural network. To improve the recognition rate and reduce the omission rate of the single-tree detection method, we introduced first-order statistical features of samples as supplementary features of individual trees and combined these with the BP neural network model to build a cascade-neural-network model. The experimental results show that the single-tree detection method for high-resolution remote-sensing images based on the BP cascade neural network proposed in this paper can achieve better detection results than the existing methods that obtain a highest matching rate and detection score. A BP cascade neural network does not need to extract different tree features artificially for different scenes, and the neural network automatically learns how to represent the most essential features of trees. Although the detection results of various methods are not ideal in complex scenarios, our method still maintains a good detection effect. Therefore, our method has better generalization performance for different scenarios.
The BP neural network has superiority in many aspects compared with the methods based on artificial rules. However, it is still a kind of shallow-learning model containing only a hidden layer of nodes, which requires a large number of neurons in the hidden layer. In recent years, object-detection methods based on convolutional neural networks (CNN) gradually became hotspots, mainly because this method demonstrated excellent accuracy in the field of object recognition and image classification [22][23][24][25][26]. The method is also used in the field of remote sensing. CNN introduces receptive fields and weight-sharing mechanisms to reduce the number of parameters that the neural network is required to train. These methods based on deep learning will be the focus of future work.

Conflicts of Interest:
The authors declare no conflict of interest.