UAV-Based Terrain Modeling under Vegetation in the Chinese Loess Plateau: A Deep Learning and Terrain Correction Ensemble Framework

Accurate topographic mapping is a critical task for various environmental applications because elevation affects hydrodynamics and vegetation distributions. UAV photogrammetry is popular in terrain modelling because of its lower cost compared to laser scanning. However, this method is restricted in vegetation area with a complex terrain, due to reduced ground visibility and lack of robust and automatic filtering algorithms. To solve this problem, this work proposed an ensemble method of deep learning and terrain correction. First, image matching point cloud was generated by UAV photogrammetry. Second, vegetation points were identified based on U-net deep learning network. After that, ground elevation was corrected by estimating vegetation height to generate the digital terrain model (DTM). Two scenarios, namely, discrete and continuous vegetation areas were considered. The vegetation points in the discrete area were directly removed and then interpolated, and terrain correction was applied for the points in the continuous areas. Case studies were conducted in three different landforms in the loess plateau of China, and accuracy assessment indicated that the overall accuracy of vegetation detection was 95.0%, and the MSE (Mean Square Error) of final DTM (Digital Terrain Model) was 0.024 m.


Study Site
Three study areas, namely, Xining (SA1), Wangjiamao (SA2), and Wucheng (SA3) located in Qinghai, Shaanxi, and Shanxi, respectively, were selected in the Loess Plateau of China ( Figure 1) and represent loess hill and gully, loess hill and loess valley area, respectively. Among them, Wangjiamao and Wucheng cover the complete catchments, and Xining covers a hillslope area. All three study areas were covered with vegetation since the implementation of the 'Grain for Green' project (changing the agriculture to conservation area) from late 1990s [43,44]. Vegetation status of three different study areas varied in their types and spatial distributions. The vegetation in Xining was manmade for ecological protection from the formal cultivation, with an average interval distance of 2 m in the terraced slopes. While in the Wuchenggou and Wangjiawao areas, vegetation are more natural but some cash crops like apples and jujubes (Chinese dates), were still planted, with more dense horizontal distance around 1 m in the slopes. The basic geographic information is listed in Table 1.

Study Site
Three study areas, namely, Xining (SA1), Wangjiamao (SA2), and Wucheng (SA3) located in Qinghai, Shaanxi, and Shanxi, respectively, were selected in the Loess Plateau of China ( Figure 1) and represent loess hill and gully, loess hill and loess valley area, respectively. Among them, Wangjiamao and Wucheng cover the complete catchments, and Xining covers a hillslope area. All three study areas were covered with vegetation since the implementation of the 'Grain for Green' project (changing the agriculture to conservation area) from late 1990s [43,44]. Vegetation status of three different study areas varied in their types and spatial distributions. The vegetation in Xining was manmade for ecological protection from the formal cultivation, with an average interval distance of 2 m in the terraced slopes. While in the Wuchenggou and Wangjiawao areas, vegetation are more natural but some cash crops like apples and jujubes (Chinese dates), were still planted, with more dense horizontal distance around 1 m in the slopes. The basic geographic information is listed in Table 1.

Unmanned Aerial Vehicle (UAV) and Global Navigation Satellite System (GNSS) Field Data Collection
Image matching point clouds from UAV photogrammetry were used as the inputs for terrain modeling. Optical aerial photographs were captured using a DJI Inspire 1 microdrone [45] mounted with a digital camera system Zenmuse X5 [46] (15 mm focal length, RGB color, and 4096 × 2160 resolution), with a battery time of approximately 18 min, and could resist wind speeds of up to 10 m/s. Detailed flight information is shown in Table 2. Pix4D Capture flight planner software was used to plan a round-trip flight line along the study areas, and automatically collect images within certain designed flight distance. All flights were completed from 10 am to 2 pm, to ensure that the image quality would not be influenced by the shades. Ground control points (GCPs) in WGS-84 were obtained by the Remote Sens. 2020, 12, 3318 4 of 18 Topcon HiperSR RTK GNSS unit [47] (10 mm horizontal positioning accuracy and 15 mm vertical  positioning accuracy), with a tripod, to ensure horizontal and vertical accuracy. Bundle adjustment was implemented in Pix4D Mapper software [48]. The point clouds were finally generated and interpolated into the grid digital surface model (DSM). Eight targets along the vegetation in Xining (SA1) were designated as check points (CPs) for the uncertainty assessment of the final terrain modeling results. These targets were 1-m-wide boards painted in black and white in a diagonal chessboard pattern.

Deep Learning (DL)-Based Vegetation Detection
Most DL networks connect simple layers for data distillation. Input information passes through a layer of filter that increases the purity in distillation to achieve the desired result [49]. Convolutional neural network (CNN) is one of the representative algorithm structures of the deep neural network structure and is a feed-forward neural network usually used in object recognition, target detection, semantic segmentation, and other issues [50,51]. A typical structure for a CNN network, U-Net [52], was implemented for vegetation detection, because of its effectiveness and simplicity. U-net adopts the principle of gradient descent, propagates data information forward, and reverses propagation to correct the parameter weights and deviations [53]. Certain layers were changed and adjusted to specific terrain modeling tasks, on the basis of the existing U-Net structure.

Training Data Generation
DL is usually used in datasets with a large amount of data, and convolutional neural networks are suitable for processing relevant image data. Therefore, the U-Net model can generate a large number of images as input data. Here, input data were randomly cropped to ensure proper representation and eliminate the influence of manual selection. Random coordinate points were expanded, based on the desired image size. The crop range was calibrated, and the crop operation fully utilized the cell size and projected coordinate information.
Data enhancement is the process of generating new data for training, based on image nature, without actually collecting new samples. Convolution operations have translational invariance, and similar transformations such as rotation and scaling of vegetation data do not change the information characteristics of the vegetation data. Here, similar data outside the sample area chart were provided to the model to ensure data diversity. Random similar transformation, scale transformation, Gaussian blur, and image enhancement were performed for the crop data, in which the rotation allocation transformation matrix and 2D Gaussian function were treated as follows-Equations (1) and (2).

of 18
where θ is the angle of rotation, and σ is the variance. For the classification task, the training data was labeled as one-hot encoding logic category, namely, 1 for vegetation and 0 for non-vegetation. Manual work was first done for the labeling task at, based on the original point clouds. The RGB and additional elevation information of vegetation of the manmade labels of three study areas were then generated from the original image matching point cloud. All labels were divided into two groups for model training and validation. Forty percent of the dataset was randomly sampled as the training data. Since the DL requires a large amount of training samples, a tool was developed based on the ArcGIS Pro [54] software, using the python script for a multi-scale replicability of the training samples. Finally, 10,000 samples of 4 dimensions (R, G, B, Z) with 128 × 128 cells were automatically generated.

Feature Selection
The data for neural network represent a multidimensional feature array, also known as a tensor, a container for numerical data of images. All transformations learned by the neural network could be summed up as tensor operations for numerical data and formed matrix extension dimensions. Spectral information (R, G, and B values) and elevation provide theoretical feasibility for the division of vegetation. The training data generated by the original point clouds had an RGB value and underwent elevation, and the input data were normalized to reasonably eliminate the scale effect.

Design of the U-Net Network
An improved U-Net framework with a slightly altered structure was used for vegetation detection. The improved U-Net produced split maps of the same size as the input data and preserved the continuity of the resolution.
The predictive model describes the relationship between input x (features) and desired output (answer) y. The system 'learns' the relationship between data and output repeatedly through differential equations and random deviations and obtains the values of a series of unknown parameters, thus, forming a set of rules on its own. These rules are applied to a set of untrained data to allow the model to predict the corresponding set of answers. This process is the core architecture of the image segmentation task. With the use of the FCN (Fully Convolutional Networks, [55]) architecture, the simple representation of the relationship between the predictive output and the input is as follows-Equation (3).
where x is the input;ŷ is the forecast output; m is the number of hidden layers that determines the depth of the network to a certain extent and represents the complexity of the network; n is the number of neurons in each layer of network, and each neuron in the convolutional neural networks is represented as a filter (nine neurons in this study); w is expressed as a weight assigned to a neuron to connect input information for signaling; and f is an activation function for nonlinear mapping.
Three specific network structures architecture with different hyper-parameters were designed ( Figure 2) for the vegetation detection tasks. In the down-sampling procedure, convolution was performed to extract features and activation values at different levels. Each convolution was based on the result of the previous layer of convolution, thus, bringing the model to a certain depth. Some convoluted feature values were de-dimensionalized from the input, through pooling, to reduce a large amount of computational consumption. The vegetation characteristics were summarized, and a wide range of features were extracted. The data were easily learned, and the model learning ability was enhanced. In the upper-sampling, the image size was expanded layer-by-layer to interpolate the feature maps at all levels. Details on the three model hyper-parameters are shown in Table 3.
convoluted feature values were de-dimensionalized from the input, through pooling, to reduce a large amount of computational consumption. The vegetation characteristics were summarized, and a wide range of features were extracted. The data were easily learned, and the model learning ability was enhanced. In the upper-sampling, the image size was expanded layer-by-layer to interpolate the feature maps at all levels. Details on the three model hyper-parameters are shown in Table 3.

Vegetation Detection Accuracy Assessment
The detection accuracy was assessed through a comparison with the reference. The reference data were manually interpolated from the original point cloud. The confusion matrix was applied to calculate the accuracy in the rasterized results.

Terrain Correction
After vegetation detection, the terrain information could be modified using the vegetation results. In terrain modeling, the ability to reasonably eliminate the vegetation points, determined the accuracy of the DTM result. In urban areas, a cross-section is usually used to completely eliminate the vegetation point and then interpolate the complement point to obtain the DTM [56]. The ground is fitted in a 2D terrain plane, and the points higher than the plane are removed. However, this trend approach always fails, because the planes are difficult to estimate, due to the dramatic reliefs of the mountainous terrains (e.g., the Loess Plateau). The alternative practice for mountainous areas is usually to universally lower the vegetation points, based on the estimation of vegetation average height [37]. This method is effective for continuous vegetation in mountainous areas and maintaining the original terrain fluctuation, but is restricted for discrete vegetation in mountain areas, due to elevation fragmentation or convex terrain [57,58]. To solve this problem, this study divided the terrain correction into two scenarios, namely, discrete and continuous vegetation areas (Figure 3). The vegetation points in the discrete area were directly removed and then interpolated, and terrain correction was applied for the points in continuous areas.

Vegetation Detection Accuracy Assessment
The detection accuracy was assessed through a comparison with the reference. The reference data were manually interpolated from the original point cloud. The confusion matrix was applied to calculate the accuracy in the rasterized results.

Terrain Correction
After vegetation detection, the terrain information could be modified using the vegetation results. In terrain modeling, the ability to reasonably eliminate the vegetation points, determined the accuracy of the DTM result. In urban areas, a cross-section is usually used to completely eliminate the vegetation point and then interpolate the complement point to obtain the DTM [56]. The ground is fitted in a 2D terrain plane, and the points higher than the plane are removed. However, this trend approach always fails, because the planes are difficult to estimate, due to the dramatic reliefs of the mountainous terrains (e.g., the Loess Plateau). The alternative practice for mountainous areas is usually to universally lower the vegetation points, based on the estimation of vegetation average height [37]. This method is effective for continuous vegetation in mountainous areas and maintaining the original terrain fluctuation, but is restricted for discrete vegetation in mountain areas, due to elevation fragmentation or convex terrain [57,58]. To solve this problem, this study divided the terrain correction into two scenarios, namely, discrete and continuous vegetation areas (Figure 3). The vegetation points in the discrete area were directly removed and then interpolated, and terrain correction was applied for the points in continuous areas.
Step 1: Identification of discrete and continuous vegetation areas. The vegetation detection result was firstly rasterized then converted into polygon by the Raster to Polygon tool in ArcGIS Pro software [54]. A threshold of 30 m 2 by expertise was then used to identify discrete and continuous vegetation areas. Vegetation areas of less than 30 m 2 were classified as discrete, and those greater than 30 m 2 were labeled as continuous. Step 1 Step 2 Step 3 Figure 3. Workflow of terrain correction.
Step 1: Identification of discrete and continuous vegetation areas. The vegetation detection result was firstly rasterized then converted into polygon by the Raster to Polygon tool in ArcGIS Pro software [54]. A threshold of 30 m 2 by expertise was then used to identify discrete and continuous vegetation areas. Vegetation areas of less than 30 m 2 were classified as discrete, and those greater than 30 m 2 were labeled as continuous.
Step 2: Point removal and spatial interpolation in discrete vegetation area. The original point cloud obtained for the UAV photogrammetry represents a surface model including the vegetation information. To achieve a terrain model, all these vegetation points should be excluded. The points in the discrete vegetation area could be directly eliminated. Since the 'holes' after the removal were relatively small, it would not affect the overall trend of the terrain. Therefore, the terrain could then be interpolated.
Step 3: Terrain correction in continuous vegetation areas when considering vegetation height. The commonly used local polynomial interpolation ignores its own terrain fluctuations. Thus, the elevation information would be lost when the points in the continuous vegetation area are simply removed. A possible solution was to estimate the terrain elevation and then modify the elevation of the vegetation points in the point cloud. With regard to the varying heights for each individual continuous vegetation area, an adaptive process with less human interaction was proposed. Vegetation height was estimated by the elevation in the 0.5 m buffer zone of each polygon. This could be achieved by the Zonal Statistics tool by ArcGIS Pro [54], using the original point clouds. The difference between the vegetation elevation point and the ground elevation in the polygonal area from DSMs was treated as the unified elevation value of the area, and the final fine DTM was obtained by subtracting the estimated mean height of each polygon. Step 2: Point removal and spatial interpolation in discrete vegetation area. The original point cloud obtained for the UAV photogrammetry represents a surface model including the vegetation information. To achieve a terrain model, all these vegetation points should be excluded. The points in the discrete vegetation area could be directly eliminated. Since the 'holes' after the removal were relatively small, it would not affect the overall trend of the terrain. Therefore, the terrain could then be interpolated.
Step 3: Terrain correction in continuous vegetation areas when considering vegetation height. The commonly used local polynomial interpolation ignores its own terrain fluctuations. Thus, the elevation information would be lost when the points in the continuous vegetation area are simply removed. A possible solution was to estimate the terrain elevation and then modify the elevation of the vegetation points in the point cloud. With regard to the varying heights for each individual continuous vegetation area, an adaptive process with less human interaction was proposed. Vegetation height was estimated by the elevation in the 0.5 m buffer zone of each polygon. This could be achieved by the Zonal Statistics tool by ArcGIS Pro [54], using the original point clouds. The difference between the vegetation elevation point and the ground elevation in the polygonal area from DSMs was treated as the unified elevation value of the area, and the final fine DTM was obtained by subtracting the estimated mean height of each polygon.

Terrain Modeling Result Validation
Evaluating the elevated generated DTM is the key to measuring accurate terrain modeling results. To achieve the validation, a comparison between the final generated DTM with CPs from field survey by GNSS unit was conducted. The Xining area was selected for the validation.

Vegetation Detection Results
Xining was selected for model training. After performance comparison for the designed U-Net network structures, the U-Net model C was finally chosen for vegetation detection. Details of three structures' performance are discussed in following Section 4.1. After model training, the model was applied in two other study areas. Figure 4 shows the results for the three study areas.

Terrain Modeling Result Validation
Evaluating the elevated generated DTM is the key to measuring accurate terrain modeling results. To achieve the validation, a comparison between the final generated DTM with CPs from field survey by GNSS unit was conducted. The Xining area was selected for the validation.

Vegetation Detection Results
Xining was selected for model training. After performance comparison for the designed U-Net network structures, the U-Net model C was finally chosen for vegetation detection. Details of three structures' performance are discussed in following Section 4.1. After model training, the model was applied in two other study areas. Figure 4 shows the results for the three study areas.  Table 4 shows that the confusion matrix of vegetation detection results in three areas with the reference. The detection accuracies were acceptable at 90.9% for Xining, 96.4% for Wangjiamao, and 87.2% for Wucheng. The vegetation detection of Wucheng was not as highly accurate as for the other two areas because the tie points in the southwest corner of Wucheng were relatively insufficient during the automatic image matching. Hence, the accuracy of the original image matching point cloud was reduced.  Table 4 shows that the confusion matrix of vegetation detection results in three areas with the reference. The detection accuracies were acceptable at 90.9% for Xining, 96.4% for Wangjiamao, and 87.2% for Wucheng. The vegetation detection of Wucheng was not as highly accurate as for the other two areas because the tie points in the southwest corner of Wucheng were relatively insufficient during the automatic image matching. Hence, the accuracy of the original image matching point cloud was reduced.

Vegetation Identification Results
Identification was conducted in the three study areas, based on the adaptive treatment of discrete and continuous vegetation ( Figure 5). The manmade vegetation spatial distribution patterns in Xining and the natural patterns in Wuchenggou and Wangjiawao were successfully identified. Vegetation

Vegetation Identification Results
Identification was conducted in the three study areas, based on the adaptive treatment of discrete and continuous vegetation ( Figure 5). The manmade vegetation spatial distribution patterns in Xining and the natural patterns in Wuchenggou and Wangjiawao were successfully identified.

Terrain Correction Results
After the vegetation identification, terrain correction was done and DTMs with 1 m resolution were then interpolated ( Figure 6). The proposed method removed the vegetation points without losing the terrain details and restored the fine DTM. Compared with orthophotos, the terrain reliefs

Terrain Correction Results
After the vegetation identification, terrain correction was done and DTMs with 1 m resolution were then interpolated ( Figure 6). The proposed method removed the vegetation points without losing the terrain details and restored the fine DTM. Compared with orthophotos, the terrain reliefs were well presented in the modeling results. The smooth color rendering of DTMs indicated that the vegetation recognition removal was good, and DTM was visually refined. were well presented in the modeling results. The smooth color rendering of DTMs indicated that the vegetation recognition removal was good, and DTM was visually refined.

Terrain Modeling Result Validation with Field Measurement Data
Ground control points in Xining by field survey were elevated to verify the DTM results (Figure 7).

Terrain Modeling Result Validation with Field Measurement Data
Ground control points in Xining by field survey were elevated to verify the DTM results ( Figure  7).  Table 5 shows the elevation comparison of the CPs. The MSE was 0.024 m, which met the standard of the accurate terrain modeling. Points D and H had the highest prediction accuracy, which were originally ground points. Correctly predicting the vegetation points ensured that the ground elevation values were preserved correctly. Point G failed the accurate elevation, because it was located at a hole even when the vegetation detection was not correct. The terrain correction of the remaining vegetation points was guaranteed.

Discussion
In this section, some extra analyses were conducted to discuss the key to the success of vegetation detection. Hyper-parameter (network structure and epoch) influence analysis was done at first to achieve an optimized parameter setting. The comparison with other two published methods  Table 5 shows the elevation comparison of the CPs. The MSE was 0.024 m, which met the standard of the accurate terrain modeling. Points D and H had the highest prediction accuracy, which were originally ground points. Correctly predicting the vegetation points ensured that the ground elevation values were preserved correctly. Point G failed the accurate elevation, because it was located at a hole even when the vegetation detection was not correct. The terrain correction of the remaining vegetation points was guaranteed.

Discussion
In this section, some extra analyses were conducted to discuss the key to the success of vegetation detection. Hyper-parameter (network structure and epoch) influence analysis was done at first to achieve an optimized parameter setting. The comparison with other two published methods (perceptron and adaptive filtering) was then done for a deeper analyses of the performance of our proposed vegetation detection method.

U-Net Hyper-Parameter Influence on Vegetation Detection Performance
The performance of the three designed different U-net networks was assessed in terms of training loss, validation accuracy, and training accuracy, to understand the influence of parameter and architecture on vegetation detection. Figure 8a shows the training loss of the three models with different epoch settings. Model A is simple with a small network layer and capacity. Its training loss reached the local minimum at 48 epochs. The training loss of model B bounced at the 16th and 38th epochs, and was overall faster than that for Model A. The training loss of model C declined smoothly and reached the local minimum at the 45th epoch. Figure 8b shows the training accuracy of the three models with different epoch settings. All three models generally showed an increasing trend. Model A in the 8th epoch to 40th epoch did not meet the saturation. Model B in the 17th and 38th epochs showed a decline in training accuracy. Model C in the 45th epoch achieved the local maximum accuracy. Figure 8c shows the validation accuracy of the three models with different epoch settings. Model A had the lowest verification accuracy. Model B was moderately complex with convolution occurring during pooling, and its verification accuracy was high. However, a substantial decline in the 15th epoch to 0.92, indicated a slightly weakened stability of its performance. Model C was the most stable and accurate with a high accuracy of 0.94 at the 45th epoch ( Figure 8c).  Model C with an epoch of 45 was selected for vegetation detection, due to its lowest loss function and highest accuracy during training and validation. When the network structure was large, the Model C with an epoch of 45 was selected for vegetation detection, due to its lowest loss function and highest accuracy during training and validation. When the network structure was large, the epoch should be increased appropriately to ensure that the parameters were updated. Merging combined the features of convolution and enhanced the upper sampling of data.

Comparison of Vegetation Detection Performance with Other Methods
Two other methods, namely, perceptron [59] and adaptive filtering [60] were selected for comparison to additional assessment of vegetation detection. Precision, recall, and F-score values were used for validation. Precision indicated the extent to which the extraction result represented the real target and the error of the model. Prediction was positive when the following values were obtained-true positive (TP) and false positive (FP)-which indicated the extraction of the correct vegetation grid. FP indicated that the ground sample was predicted as a vegetation sample, and FP was a 'false positive' situation. True negative (TN) was achieved when the results predicted for the ground was also a ground sample. Recall represents the extent to which real targets can be extracted and indicates the model's leakage. Predicting vegetation as true (TP) and vegetation samples as ground samples were a false negative (FN), i.e., no vegetation samples were extracted. Precision was the number of samples that were positive relative to the predicted positive, and recall was relative to the number of positive samples in the original sample. The F-value was the reconciliation average of precision and recall. The formulas for precision, recall, and F-values were as follows (Equations (4)- (6). Figure 9 shows the precision, recall, and F-value of the three methods. Our improved U-Net architecture had the highest values for all three study areas. Particularly, the best identification result was observed in Xining with a precision of 0.91. The performances of last two methods were seriously weaker than that of our improved U-Net architecture. Perceptron lacked the hidden layer and did not introduce random deviation. The final classification result was based on the hyperplane, which could not adapt to the complex terrain, resulting in a low accuracy for vegetation detection. Adaptive filtering was excessive in vegetation recognition, and its results depended on the sketched vegetation range. This phenomenon required the manual sketching of the training area, as a supervised area for vegetation recognition in each study area.

Conclusions
This study proposed a UAV photogrammetric framework for terrain modeling in dense vegetation areas. With the loess plateau of China as the study area, a DL and terrain correction ensemble method was proposed and applied. An improved U-net network for vegetation segmentation was presented. The feature combination of RGB+DSM was used for vegetation detection. According to four-fold cross-verification, the accuracy was 94.97%, and the model had a good generalization ability. The influence of U-Net architecture and parameter epoch setting on vegetation detection performance was also assessed. Comparison with other methods confirmed the better performance of the proposed technique. Fine DTM generation method for terrain modeling was also put forward. The vegetation area was divided into discrete and continuous, and adaptive terrain correction was proposed and realised. DTM accuracy was evaluated with the field measurements. This framework could be applied in dense vegetation, with an advantage of low-cost UAV photogrammetry when laser scanning was limited.
Author Contributions: Conceptualization, J.N. and K.X.; algorithm, J.N. and K.X.; classification analysis, J.N.; process the data, K.X.; writing-original draft preparation, J.N. and K.X.; writing-review and editing, H.D., J.S. and N.P.; supervision, L.X. and G.T.; funding acquisition, L.X. and G.T. All authors have read and agreed to the published version of the manuscript.