Deep Fusion of DOM and DSM Features for Benggang Discovery

: Benggang is a typical erosional landform in southern and southeastern China. Since benggang poses signiﬁcant risks to local ecological environments and economic infrastructure, it is vital to accurately detect benggang-eroded areas. Relying only on remote sensing imagery for benggang detection cannot produce satisfactory results. In this study, we propose integrating high-resolution Digital Orthophoto Map (DOM) and Digital Surface Model (DSM) data for efﬁcient and automatic benggang discovery. The fusion of complementary rich information hidden in both DOM and DSM data is realized by a two-stream convolutional neural network (CNN), which integrates aggregated terrain and activation image features that are both extracted by supervised deep learning. We aggregate local low-level geomorphic features via a supervised diffusion-convolutional embedding branch for expressive representations of benggang terrain variations. Activation image features are obtained from an image-oriented convolutional neural network branch. The two sources of information (DOM and DSM) are fused via a gated neural network, which learns the most discriminative features for the detection of benggang. The evaluation of a challenging benggang dataset demonstrates that our method exceeds several baselines, even with limited training examples. The results show that the fusion of DOM and DSM data is beneﬁcial for benggang detection via supervised convolutional and deep fusion networks.


Introduction
Benggang is a Chinese word for a typical gully erosional landform [1]. Roughly translated, benggang means "slope collapse" or "collapsing gully" in English. Benggang can be found in hilly areas covered by weathered granite crusts in southern and southeastern China. Similar to gullies, the development of benggang is caused by collective impacts of gravity and runoff water, involving complex processes of sediment collapsing and transport [2]. Apart from natural factors, anthropogenic activities that destroy vegetation cover also contribute to the development of benggang [3]. Typically, continuous benggang erosions at gully heads result in chair-like forms with fragmented landscapes. Many studies have investigated the geographical distributions, development mechanisms, and erosion patterns of benggang landscapes [2][3][4][5].
In 2015, the United Nations (UN) released the 2030 Agenda for Sustainable Development and introduced 17 global Sustainable Development Goals (SDGs). Goal 15 is about the protection of land ecosystems, aiming to promote environmental awareness and encourage ecological conservation across the world [6]. Within the framework of SDGs, the UN has defined the concept of Land Degradation Neutrality (LDN) and encouraged the international community to combat land degradation [7]. With a fast-developing erosion mechanism, benggang pose significant risks to local ecological environments and economic infrastructure, as they may destroy forests, fertile lands, roads, and human habitats [8]. In order to achieve LDN and related SDGs, necessary and immediate management and planning actions should be taken to reverse land degradation and restore benggang areas. Before we take appropriate preventive or control measures, it is vital to accurately detect benggang-eroded areas. Traditionally, the primary method to identify benggang is to conduct field surveys, which are costly in terms of resources and time. Recently, researchers have adopted various remote sensing technologies for benggang monitoring, including three-dimensional laser scanning [4] and Unmanned Aerial Vehicle (UAV) photogrammetry [9]. However, benggang are usually of small scales and covered with vegetation in the middle and late development stages, making it challenging to identify the boundary of benggang only based on remote sensing data. Without field investigation, they are very difficult to recognize from remote sensing images by manual interpretation. The current benggang investigation practices mostly start with manual identification of potential benggang areas from remote sensing imagery and are then followed by field surveys to localize benggang units. The entire workflow is time-consuming and error-prone, calling for a robust and automatic benggang discovery approach, especially for large areas. Still, since benggang are widely distributed and characterized by fast development, automatic and accurate detection of benggang areas remains a challenge.
Based on high-resolution remote sensing images, researchers have applied various machine learning methods to detect and monitor specific land deformation phenomena. Recent breakthroughs of deep learning in computer vision have offered many innovative methods and tools for remote sensing image understanding [10]. Among them, convolutional neural networks (CNNs) are the most widely used architecture for high-level image feature representation. Remarkable classification and detector performance has been achieved by either fine-tuning pretrained CNNs [11], modifying CNN frameworks [12], defining novel objective functions [13], or constructing multiple network ensembles [14]. Being powerful deep learning models in computer vision, CNNs have demonstrated their advantages in slope failure detection [15], landslide susceptibility evaluation [16], and landslide mapping [17,18]. It is also beneficial to integrate different machine learning methods for detecting land deformation phenomena. For example, using different earth observation data (satellite images and Digital Elevation Models), Piralilou et al. combined a multilayer perceptron neural network and random forest for landslide detection [19]. Ye et al. leveraged a deep belief network and logistic regression classifier to detect landslides using hyperspectral remote sensing images [20]. As these studies adopted loosely coupled models, we conjecture that integrating different data and models into an end-to-end learning framework may be beneficial for complex landform detection. Some studies have modified vanilla deep learning models to account for specific landform characteristics, such as an improved U-Net model for post-earthquake landslide extraction [21], a progressive CNN training scheme to promote generalization performance [22], and a cascaded deep learning model that accounts for landslide features from limited samples [23]. Compared with common natural or human-made objects, benggang is not a well-defined concept, with large intra-class appearance variations. Benggang comprises complex terrain landscapes without clear boundaries and distinct texture features. Directly applying deep learning detectors for benggang discovery may not achieve satisfactory performance. Therefore, we contend that an effective detection model should account for particular landscape characteristics of benggang.
Other sources of geospatial data such as high-resolution Digital Surface Model (DSM) data can provide complementary information to remote sensing image data. High-resolution DSM data contains rich information on terrain elevation capable of describing fine-grained characteristics of complex terrains and abrupt edge changes. Despite deep learning-based feature fusion being explored for remote sensing image understanding, most studies focus on visual feature fusion using different feature descriptors [24] or features extracted from multispectral images [25]. In this study, we propose integrating high-resolution Digital Orthophoto Map (DOM) and DSM data for efficient and automatic benggang detection with an integrated end-to-end learning model. We believe this fusion of multi-source monitoring data has the benefits of high detection precision, low cost, and robustness to landform variations, which are favorable for large-scale benggang investigation and studies on the mechanism of benggang erosion.
To the best of our knowledge, we are the first to discover benggang areas using deep learning-driven fusion based on DOM and DSM data. This study makes the following contributions: (1) We propose using a two-stream CNN framework to integrate aggregated terrain and image features for benggang discovery using high-resolution DOM and DSM data; (2) We develop a supervised, diffusive convolutional encoding scheme that aggregates local geomorphic features, yielding expressive terrain representations for benggang; (3) The developed deep fusion model is evaluated with a challenging benggang dataset.
Supervised by limited training samples, our approach achieves satisfactory detection performance.
Similar erosional gully-like landforms can also be widely found in other countries, such as "lavaka" in Madagascar [26,27], "vocoroca" in Brazil [28] and "calanchi" in Italy [29,30]. Cost-effective monitoring of these gullies is critical for environmental protection in these countries. However, the current practices are largely limited to manual interpretation of remote sensing images and field surveys which hinge on the domain knowledge of individual experts and the data quality of the images. Machine learning methods have been used to extract specific types of land deformation phenomena (e.g., landslides) based on remote sensing images [15][16][17]21], but their utilities in detecting complex gully-like landforms are limited because they largely rely on visual features while ignoring terrain features that are specific to gully landforms. We believe the proposed detection approach can also be used in other areas of the Earth, helping local authorities and residents to better monitor and manage erosional gully landscapes.

Study Region and Data Description
The proposed approach was tested and evaluated with a DOM and a DSM dataset. The two datasets were produced from a set of aerial images, which were collected in 2018 over a hilly region of Deqing County, Guangdong Province, China. The study region has a subtropical monsoon climate, with a large solar altitude angle, strong radiation, high year-round temperature, and abundant rainfall, which provides sufficient external driving forces for the occurrence of benggang. Mountain soils are formed mainly by the weathering of granite that consists of crystals of quartz and feldspar. The weathered crust is loose and is prone to collapse under the influence of gravity. The original aerial images were acquired with three bands: blue, green, and red. The data were automatically preprocessed by INPHO, including aerial triangulation, image dense matching (for the DSM), and differential correction (for the DOM).
The DOM and DSM data have a spatial resolution of 0.2 and 0.5 m per pixel, respectively. The study region is partitioned by a regular grid with a resolution of 26 pixels, resulting in a cell size of 13 m × 13 m. We chose this resolution for the grid because it is well suited for providing fine-grained image and terrain information for detecting benggang areas, which have a minimum size of 50 m × 150 m. In both the training and test datasets, benggang areas were manually annotated by experts who have rich field experience in the study region. The labeled benggang areas were also validated by field observations. In field trips, we paid particular attention to areas that were covered by vegetation and difficult to interpret based on the DOM data. The data contain complex benggang landscapes that are well representative of the benggang detection problem (Figures 1 and 2).

Detection Approach
Upon the availability of semantically rich feature representations, the benggang discovery task can be casted as a classical object detection problem, which has been extensively researched over the past several decades [31]. The deep fusion-driven benggang discovery framework is shown in Figure 3. The study region was partitioned into regular grid cells, each of which was used as the basic unit for feature extraction and representation. First, we learned a CNN to extract the abstracted image features supervised by detected benggang areas using high-resolution DOM data. The activation of the last hidden layer of the CNN was used as the high-level representation DOM features for the task. Meanwhile, DSM data were used to build a CNN-based high-level encoding scheme that aggregated local low-level geomorphic features. This encoding scheme relies on a diffusive convolutional neural network [32], helping construct high-level geomorphic descriptors, which are also supervised by detected benggang training samples. Upon the availability of the two types of high-level features, we used a two-stream CNN to integrate terrain descriptors and activation image features for benggang detection and localization. With a gated fusion network, both the DOM and DSM features were jointly embedded into a latent semantic space which had much better discriminative capabilities than each type of feature alone. Then, benggang areas could be discovered by a classification model, such as fully connected networks using the binary cross entropy loss function.

Extracting High-Level DOM Features
In computer vision tasks, it has been shown that CNN models trained with a huge amount of data are able to extract deep visual features. Therefore, the VGG network [33] trained with the ImageNet dataset was used to extract representative high-level DOM features in our approach. We used the VGG network to derive 512-dimensional activation feature vectors for DOM images.

Constructing Aggregated DSM Features
Terrain features are critical for benggang recognition and analysis. However, original terrain features are inadequate for complex scene interpretation. In this study, we propose constructing aggregated DSM features based on a diffusive convolutional neural network [32] which is trained by labeled benggang data. The diffusive convolutional neural network has the benefit of extracting semantically meaningful high-level terrain representations. A diffusive convolution was defined to simulate the process of benggang erosion. We considered each grid cell as a graph node. Given a graph G with N nodes, a transition tensor Tr ∈ R N×H×N can be built that encodes the probability of moving from one node to another one within H hops. G can be described by a terrain feature tensor X ∈ R N×F , where F is the size of the feature dimensionality. Our task is then to encode informative terrain features with diffusive convolutional embeddings for all nodes. For the node i during the tth hop, the output representation can be written as i ∈ R (t)×N denotes the transition matrix for the tth hop, and denotes the Hadamard product. To enable the computation of h (t) i , we needed to construct aggregated terrain feature vectors X for all graph nodes and derive transition tensor Tr.
Based on DSM data, we could extract multi-dimensional terrain feature vectors at the granularity of the grid cells. For each node, a 75-dimensional vector was constructed by concatenating the following features: (1) Average elevation over all pixels in a grid cell; (2) Average elevation slope over all pixels in a grid cell; (3) Maximum elevation difference between pixels; (4) Maximum slope difference between pixels; (5) Average gradient orientations; (6) Maximum elevation from the centroid to four corner points and four edge mid-points; (7) Average elevations over pixels with the same horizontal coordinates (26 dimensions); (8) Average slope over pixels with the same horizontal coordinates (26 dimensions); (9) A 16-dimensional vector that encodes gradient statistics based on the gradient magnitudes and orientations of all pixels. For each pixel, its gradient is weighted by the inverse of the distance between the pixel and the centroid. The 360-degree range of orientation is equally divided into 16 bins. The weighted gradients are accumulated into these 16 orientation bins according to their gradient orientations. After obtaining all the 16 elements, we reset the maximum accumulated gradient as the first element and arranged the rest of the accumulated gradients in clockwise order (16 dimensions); (10) The normal orientation, which is recorded as the serial number of bin (0-15) that has the maximum accumulated gradients.
The inter-node transition tensor can be computed as follows: (1) Compute the transition distances between the centroid of each node and the centroids of its eight nearest neighboring nodes (queen-based neighbors) ( Figure 4). The transition distance between node o and o can be calculated as where ∆h is the difference of the average elevation between centroids o and o' (i.e, h o − h o ) and d o p →o is the projected distance between the two centroids; (2) The transition distances are labeled as positive or negative, depending on whether the destination node has a higher average elevation than the origin node. Positive (negative) distances indicate that the origin node is higher (lower) than the destination nodes; (3) Signed distances are further weighted according to the angle α between the transition link and the normal orientation. The weights are inversely proportional to the range of the angle; (4) The inter-node transition probabilities of the first hop T (1) o are calculated as the inverse of the signed transition distance: where p oo is a weight to measure the effect of the angle α on transition probability and Sign (∆h) returns 1 if ∆h > 0; otherwise, it returns −1;

Fusing DOM and DSM Features
Following the gated multimodal unit model (GMU) [34], we integrated the extracted high-level image and terrain features in a supervised learning scheme. Linear transformations were applied to two feature tensors, resulting in two vectors with the same dimension for each node. The fusion was performed by a gated unit that combined information from the two modalities. For each node, the resultant fusion vector h f i is regulated by a gate z: where [,] is a vector concatenation operation and W z is the trainable gate weight, initialized from a uniform distribution [35]. We use h i need to be reshaped into one-dimensional vectors before being used for fusion.
A fully connected layer is used as the classification model to supervise the fusion training, using the binary cross entropy loss function: where y i is a binary classification label (benggang or non-benggang) and θ is a learnable vector. During testing, the trained image and terrain feature extraction methods were applied to the gridded DOM and DSM data, respectively. The trained GMU model was used to produce fusion node vectors, which were fed into a binary classifier (e.g., a fully connected neural network) to obtain the benggang detection results.

Implementation Details
To extract the DOM features, before being fed into the DOM stream convolutional network, all training and test images were cropped and scaled to patches of 224 × 224 pixels by maintaining the original aspect ratio. The DOM stream is based on the VGG network [33]. The VGG network comprises 13 convolutional layers (with 3 × 3 convolutional filters), 5 max-pooling layers (with a kernel size of 2 × 2 pixels), and 3 fully connected layers. The learning rate was set to 0.0001.
For constructing aggregated DSM features, the diffusive convolutional neural network consisted of a diffusive convolutional activation layer and a fully connected layer, and the activation functions for the two layers were ReLu and Softmax, respectively. The learning rate was set to 0.05.
As for feature fusion, the fusion training needed at least 10 epochs and reached convergence after the loss remained under 0.01. Using the Adam optimizer [36], the model was trained with a batch size of 32. Before being used for training, the nodes were completely reshuffled. The learning rate was decayed by 0.1 for every 5 epochs.

Experimental Setting
We conducted three experiments to evaluate the proposed benggang detection approach on two datasets, each of which contained both DOM and DSM data for five samples of rectangular areas (Figure 1). The summaries of the two datasets are given in Table 1. The configurations and results of the three experiments are presented in the following. All experiments were conducted on a desktop machine with an Intel ® i7-8700K (3.7 GHz) CPU and a NVIDIA GeForce RTX 2080Ti GPU. The entire feature extraction and fusion method was implemented using PyTorch on a Microsoft Windows 10 operating system.
In the first test, we evaluated the proposed approach by a fivefold cross validation scheme on five continuous benggang areas from the first dataset. Each area consisted of 432 (i.e., 24 × 18) cells. For each run, one area was used as a test set and evaluated by the trained model using data from the other four areas. The average performance results over five runs are reported in Table 2. We used the precision, recall, and F1-score as the performance metrics to compare the proposed approach against the following baselines: (1) VGG-DOM: a classification model based on the VGG network [33] using only DOM data. VGG16 is a widely used deep convolutional neural network with 13 convolutional layers and small-sized (3 × 3) convolution filters. The DSM data were not used in this baseline, and no data fusion was performed; (2) DCNN-DSM: a diffusive convolutional neural network (DCNN) [32] using only DSM data. Supervised by the labeled data, the DCNN model can learn integrated representations via diffusive convolutions that leverage both local attribute and graph structure information. Similar to VGG-DOM, only one type of data was used, and no data fusion was performed; (3) SimpleDSM: a variant of the proposed method using raw terrain features (without using aggregated terrain features that are learned by the diffusive convolutional neural network). Therefore, only the DOM convolutional network is used in the original two-stream CNN model; (4) Concat-Fusion: a variant of the proposed method using a simple fusion method that is based on feature concatenation. The two-stream CNN architecture was used, but the gated feature fusion was replaced with simple concatenation; (5) Linear-Fusion: a variant of the proposed method using another simple fusion method that is based on linear summation of the DOM and DSM features. Equal weights are used for the summation of the two modalities. In other words, linear feature summation was used as the fusion method rather than the gated feature fusion in the full model. The precision was computed as the ratio between the number of correctly detected benggang cells and the total number of cells classified as benggang. The recall was computed as the ratio between the number of correctly detected grid cells and the total number of benggang cells. The F1-score is the geometric mean of the precision and recall. Table 2 shows that the proposed deep fusion-based approach achieved better detection performance over the compared baselines. The variations of our approach over different test examples were also relatively small. The empirical improvements over VGG-DOM and DCNN-DSM could be attributed to the fusion of both DOM and DSM information. The performance gain of the proposed approach over SimpleDSM indicates the advantage of using diffusive convolutional neural networks in training aggregated terrain features. The use of the gated fusion model in the proposed approach was beneficial, as demonstrated by the improvements of the performance metrics over the other two fusion methods (i.e., Concat-Fusion and Linear-Fusion).

Comparison with Baselines
The second experiment was to compare the performance of the proposed approach with the other baselines for another five examples in the second dataset (see Figure 1). The five tested examples contained three benggang areas and two non-benggang areas. The tested model was trained using the first dataset. Table 3 indicates that our approach was superior to the other baselines over the three performance metrics. The other baselines incorrectly classified some grid cells as benggang in the two non-benggang areas, but our approach could avoid these mistakes. In the last experiment, we used different numbers of training samples from the first dataset and tested over the rest of the samples in the first and second datasets with the goal to evaluate the generalization capabilities of the proposed approach. Table 4 shows that the performance gains of our approach over the other two baselines were more prominent when using a small amount of training data, implying that our approach could generalize well using limited training samples.  Figure 6 presents the detection results of some examples, showing that the proposed approach could distinguish contiguous benggang areas from complex backgrounds. The proposed approach was also robust for complex non-benggang backgrounds. The two areas in Figure 6 contain a mixed set of different landscapes, including forests, roads, and farmland. The proposed approach was able to distinguish them from benggang areas. The VGG-DOM method tends to produce false positive results, since it is not able to distinguish non-benggang areas with similar texture patterns to benggang areas. DCNN-DSM performed the worst, indicating merely relying on terrain features is not robust and should be integrated with image features. Figure 6b shows that the Concat-Fusion method frequently labeled non-benggang areas as benggang since it treated the DOM and DSM equally and may not have chosen the most discriminative local features.

Parameter Selection
We investigated the impacts of one parameter on the detection performance: the number of diffusive hops when constructing aggregated terrain features, following the same setting as the first experiment. Table 5 shows that when h = 3, the model achieved the best and most stable performance. We attribute this optimal selection to the sizes of the benggang in the studied region. The sizes of the benggang areas ranged from 50 m × 150 m to 150 m × 350 m, meaning that three hops (i.e., 26~55 m in transition distance) were suitable for capturing the change patterns of the elevation variations across the benggang boundaries or within the benggang areas. The cell size and hop number could be adjusted when given different image resolutions and benggang sizes.

Computational Effiency
We compared the training and testing time cost of the proposed method and three baselines for the second experiment. Table 6 shows that our approach had approximately similar time costs to VGG-DOM and Concat-Fusion. The DCNN-DSM model performed much faster at both the training and testing stages because it only handled DSM data. The time costs were practically acceptable for the benggang detection task.

Discussion
Since benggang areas are surrounded by similar landforms in mountainous southern and southeastern China, it is challenging to detect them by manual inspection or relying on one single source of earth observational data. If benggang areas are covered with vegetation, DOM data may not provide sufficient texture information for benggang detection. Without other sources of information, bare lands or farm lands after harvest would confuse the DOM-based classifier. On the other hand, the development of a benggang is driven by consistent erosion on its gully head, causing significant elevation variations along its boundary. The gully bottom and deposition area have relatively mild elevation changes. Therefore, DSM data can be of help in benggang detection. However, since the spatial resolution of DSM data is usually much lower than that of DOM data, using only DSM data may not produce satisfactory results, as indicated by our tests. The integration of DOM and DSM data thus allowed us to examine three-dimensional landscape models with high-resolution texture, which provided much richer feature information than either DOM or DSM data. We proposed integrating DOM and DSM data under a deep gated fusion framework, taking advantage of the most effective discriminative capabilities of image and terrain features.
To compensate the coarse resolution in the DSM data, we used diffusive convolutions to extract aggregated meaningful terrain features that were able to preserve the variation patterns of elevation for benggang areas. We note that the benggang boundaries have distinct feature vectors from non-benggang areas because they are characterized by significant elevation variations. The diffusive convolutional features thus can capture such variations to facilitate the discovery of benggang boundaries. We used t-SNE [37] to visualize the feature embeddings of the compared detection approaches in 2D space. Figure 7 shows that the proposed deep fusion approach could learn two embedding clusters that could be easily separated, whereas other baselines failed to distinguish benggang and non-benggang cells since the learned embeddings were significantly overlapped. Being totally data-driven, the gated fusion mechanism facilitates the interpretation of the most informative integrated features based on DOM and DSM features. According to Equation (5), the gate activations z regulates the influences of DOM and DSM data on benggang detection. Thus, we could use the averages of z to see which data modality had greater effects on the test results. Figure 8 shows the quantitative scores that describe which data modality was more influential for each detected cell for two areas. The grid cells with a blue (red) color show that DSM (DOM) data played a more important role in the fusion model. According to the two samples, we can see that the terrain features were more useful in detecting benggang areas or identifying non-benggang areas if they had similar image features to those of benggang areas (e.g., farmlands to the right side of Figure 8a). Image features are more helpful when we try to distinguish non-benggang areas from benggang areas if these areas present distinct texture features (e.g., roads in Figure 8b).

Conclusions
This study explores the possibility of combining DOM and DSM data for detecting benggang, a common erosional landform in southern and southeastern China. Diffusive convolutional neural networks are used to extract representative terrain features, which are then integrated with CNN-derived image features to label benggang landscapes. We have demonstrated that the proposed detection approach achieved performance superior to several baselines, showing that the fusion of DOM and DSM data is beneficial for benggang detection via supervised convolutional and deep fusion networks. Future work will focus on the detection of different development stages of benggang and the evaluation of erosion risk for the surrounding environments. We also plan to collect DOM and DSM data from other areas in southern China and perform extensive evaluations on the proposed fusion-based detection approach.