Generalized Sparse Convolutional Neural Networks for Semantic Segmentation of Point Clouds Derived from Tri-Stereo Satellite Imagery

We studied the applicability of point clouds derived from tri-stereo satellite imagery for semantic segmentation for generalized sparse convolutional neural networks by the example of an Austrian study area. We examined, in particular, if the distorted geometric information, in addition to color, influences the performance of segmenting clutter, roads, buildings, trees, and vehicles. In this regard, we trained a fully convolutional neural network that uses generalized sparse convolution one time solely on 3D geometric information (i.e., 3D point cloud derived by dense image matching), and twice on 3D geometric as well as color information. In the first experiment, we did not use class weights, whereas in the second we did. We compared the results with a fully convolutional neural network that was trained on a 2D orthophoto, and a decision tree that was once trained on hand-crafted 3D geometric features, and once trained on hand-crafted 3D geometric as well as color features. The decision tree using hand-crafted features has been successfully applied to aerial laser scanning data in the literature. Hence, we compared our main interest of study, a representation learning technique, with another representation learning technique, and a non-representation learning technique. Our study area is located in Waldviertel, a region in Lower Austria. The territory is a hilly region covered mainly by forests, agriculture, and grasslands. Our classes of interest are heavily unbalanced. However, we did not use any data augmentation techniques to counter overfitting. For our study area, we reported that geometric and color information only improves the performance of the Generalized Sparse Convolutional Neural Network (GSCNN) on the dominant class, which leads to a higher overall performance in our case. We also found that training the network with median class weighting partially reverts the effects of adding color. The network also started to learn the classes with lower occurrences. The fully convolutional neural network that was trained on the 2D orthophoto generally outperforms the other two with a kappa score of over 90% and an average per class accuracy of 61%. However, the decision tree trained on colors and hand-crafted geometric features has a 2% higher accuracy for roads. Remote Sens. 2020, 12, 1289; doi:10.3390/rs12081289 www.mdpi.com/journal/remotesensing Remote Sens. 2020, 12, 1289 2 of 36

Due to this success, DL also gained popularity in the fields of photogrammetry and remote sensing [57][58][59], as it plays a key role in a variety of applications. Such applications include, inter alia, classification and segmentation [60][61][62], urban construction and planning [63][64][65][66], agriculture [67][68][69][70], the intersection of archaeology and remote sensing [71], predicting poverty [72][73][74], natural disaster and crisis management, and infrastructure management [75,76]. However, little research has thus far explored the utility of point clouds derived from tri-stereo satellite imagery for DL. One reason for this may be the fact that cross-correlation, the way the convolution operation is implemented in DL libraries, is not directly applicable to the three-dimensional space. In addition, adding geometric information from tri-stereo satellite imagery is a challenging machine learning problem, because the 3D geometry quality is low The poor geometric precision is caused by noise and low radiometric resolution of satellite imagery, as well as smoothing effects of adopted dense image matching algorithms. Nevertheless, this extra geometric information can be a valuable source of features for machine learning algorithms.
Scene understanding based on point clouds derived from tri-stereo satellite imagery is potentially a valuable information source for businesses, as it can enable business processes which were not possible before, or bring efficiency and effectiveness gains to existing ones. For example, businesses can conduct measurement tasks for a construction site where aerial laser scans are not possible, e.g., due to restrictions in flight permission or time constraints. Moreover, businesses can use it for surveillance tasks, e.g., nearby oil pipelines. The military can use this information to protect embassies in the case of an emergency or in the case of a natural disaster. Consequently, point clouds derived form tri-stereo satellite imagery can bring value for companies and public institutions alike, which can incorporate it in the life cycles of their business processes [77], in, e.g., process monitoring [78,79], process compliance [80], process conformance checking [81,82], process mining [83][84][85][86], and performance mining [87][88][89].
In this paper, we study the applicability of point clouds derived from tri-stereo satellite imagery for GSCNNs. To the best of our knowledge, this is the first work that applies DL to derived point clouds from tri-stereo satellite imagery. We learn directly on geometric information without the need to transform it into a format that is suitable for the standard convolution operation. In particular, we train a Fully Convolutional Neural Network (FCNN) [90] that uses generalized sparse convolution as introduced by Choy et al. [91] to distinguish the classes C road, building, vegetation, vehicle, and clutter, i.e., C = {clutter, road, building, tree, vehicle}, where clutter refers to points that do not belong to any of the other four categories. The network architecture follows the U-Net structure from Ronneberger et al. with an encoder followed by a decoder that is connected via skip connections [92]. In addition, the model incorporates residual learning [93], Rectifier Linear Units (ReLUs) [94,95], and batch normalization [96,97]. We compare the performance of this model to FCN-8s [90] that is trained on an orthophoto and to a decision tree that uses hand-crafted geometric features [98] and color. Hence, we compare the model to a representation learning technique, and once to a non-representation learning technique. In doing so, we report the average precision [99], average recall [99], average F1 score [99,100], average accuracy, overall accuracy [101], kappa score [102], Matthew's Correlation Coefficient (MCC) [103,104], and confusion matrices. We use the Minkowski Engine [91], which is built on top of PyTorch [105], for training the GSCNN. For FCN-8s, we use Caffe [106]. We use R to train the decision tree [107], and the software package OPALS (https://opals.geo.tuwien.ac.at) for computing the hand-crafted geometric features [108]. All trained GSCNNs are made public on PyTorch Hub after each epoch to facilitate research and ease reproducibility on PyTorch Hub. We do the same for the decision trees. For further information, please visit the GitHub project (https://github.com/MacOS/ReKlaSat-3D/) for this paper. A general overview of the workflow is depicted in Figure 1. Overview of the general workflow. We use tri-stereo Pléiades images of the study area, Waldviertel. The images are then processed to derive the 3D point cloud and the corresponding orthophoto. Whereas the Convolutional Neural Network (CNN) approach utilizes the orthophoto, the GSCNN semantic segmentation uses the 3D point cloud as input. For the decision tree, we compute hand-crafted geometric features while we leave the color features as they are. Finally, we compare the GSCNN results with the CNN approach and the decision tree, i.e., a strong baseline (Section 4.2).
The remainder of this paper is structured as follows. In Section 2, we present the background and related work. This includes 3D reconstruction, DL on satellite imagery, the difficulty for DL of learning directly on point clouds, and a discussion about reproducibility and choosing baselines in machine learning. In Section 3, we describe the study site and our dataset, the photogrammetric workflow, our DL architecture and training procedure for the point clouds, and our steps towards a more scientific approach in machine learning. We present our results in Section 4. In Section 5, we discuss our results in the light of generalizing the convolution operation from the two-dimensional grid world to higher dimensions, and we end this section with future research directions. Finally, we conclude and summarize this paper in Section 6.

Background and Related Work
Section 2.1 describes the photogrammetric workflow of point cloud reconstruction from tri-stereo satellite images. Section 2.2 gives an overview of how deep learning is currently used and studied in remote sensing. Section 2.3 focuses on DL on point clouds. Finally, Section 2.4 discusses reproducibility in science, especially in machine learning, and the importance of choosing baselines.

3D Reconstruction from Satellite Images
The photogrammetric workflow including the 3D reconstruction was performed using the Match-AT and Match-T Digital Surface Model (DSM) modules of the Trimble Inpho software [109]. The software is designed to perform precise image block triangulation, point cloud reconstruction, and orthophoto generation. Overall, the image processing chain consists of the following main steps: (1) import of the images, Rational Polynomial Coefficients (RPCs), and Ground Control Points (GCPs), followed by image pyramids generation; (2) identification of GCPs locations within the images; (3) RPCs refinement based on the GCPs and automatically extracted tie points; (4) dense image matching for 3D reconstruction; and (5) point cloud interpolation for DSM derivation. Finally, the DSM is used for orthorectification of a Nadir image, which is further considered as input data in the FCN-8s approach. The software is highly automated, and hence the user is limited to a few adjusting steps, e.g., the manual identification of GCPs in images and correction of points with high residuals. For dense image matching two strategies were employed [110]: Feature Based Matching (FBM) and Cost Based Matching (CBM). The output of dense image matching are 3D point clouds, which are further used as input for DSM derivation.

Deep Learning on Satellite Images
DL has been extensively applied on remote-sensing imagery in recent years [57]. Hu et al. investigated the generalization properties of CNN extracted features from ImageNet to high-resolution remote sensing images for scene classification [111]. In that regard, the authors proposed two scenarios for extracting features: one for extracting features from fully-connected layers, and one for extracting features from convolutional layers. The extracted features are then used as an input for a linear Support Vector Machine (SVM). This linear classifier is evaluated on the UC Merced Land Use Dataset and the WHU-RS dataset. According to the paper, the linear SVM outperforms current state-of-the art hand-crafted feature approaches by a large margin, indicating that the features learned by the CNN have better generalization properties, even when the features are learned from a different domain, in this case from ImageNet natural images to the high-resolution remote sensing images from UC Merced Land Use and WHU-RS datasets. Sun et al. explored the use of encoder-decoder architectures for semantic image segmentation [112]. Bischke et al. applied DL to aerial imagery and addressed the problem of sharp object boundaries by proposing a cascaded multi-task loss [113]. Similarly, Yi et al. used a U-Net architecture with Res-Net blocks for urban building segmentation in VHR remote sensing imagery [114]. Guo et al. also segmented buildings but incorporate superresolution in their approach, which shows promising results as the authors report an improvement of roughly 19% for the Jaccard index as well as the kappa score [115]. Eerapu et al. proposed a dense refinement residual network (DRR Net) for road extraction in aerial imagery [116]. In addition, the authors trained their network on a composite loss function surface to place emphasis on small objects. Marmanis et al. studied the use of an ensemble of CNN for semantic segmentation [117].
Li et al. used three-dimensional convolutions for Hyperspectral Image (HSI) semantic segmentation [118]. CNN architectures that use three-dimensional convolutions are called 3D-CNN. They are different from 2D-CNN, as 3D-CNN convolve the spatial and the spectral dimension simultaneously with three-dimensional kernels, while 2D-CNN use two-dimensional kernels. Sellami et al. also trained a 3D-CNN on semantic segmentation and proposed a dimensionality reduction technique for adaptively selecting spectral bands before training the 3D-CNN [119].

Deep Learning on Point Clouds
Applying DL on point clouds is not straightforward since point clouds do not have twodimensional grid structures, which is assumed by standard convolutional neural networks. In addition, images have a uniform density distribution, but point clouds do not. Therefore, it is not a surprise that previous research in remote sensing used other machine learning techniques, such as decision trees [98] and random forests [120]. Some researchers focus on developing features (i.e., feature engineering) to boost the performance of these machine learning techniques [121]. However, in CV, many researchers propose techniques that map the original point cloud to a two-dimensional grid structure as an alternative. Zhou et al. proposed dividing the space into three-dimensional voxels, and, for each of these voxels, a feature representation is learned by a voxel feature encoding layer [122]. This leads to a grid structure of features and hence allows the use of standard convolutional layers. Qi et al. proposed a network that learns on unordered point sets by applying symmetric functions [60]. Since the ordering in a set is of no importance, their solution does not assume that neighbors in the input tensor are actually neighbors in the point cloud. However, recent work focuses on the convolution operation itself by altering it such that it can directly be applied on the point cloud. Hua et al. proposed a pointwise convolution which queries the neighborhood of a point on the fly, bins the neighbors into kernel cells, and then applies the convolution operation [123]. While this operation is promising, it is currently difficult to apply on large scale as the current implementation is not optimized.
Vo et al. presented the results of the 2015 Institute of Electrical and Electronics Engineers (IEEE) data fusion contest where participants used Light Detection And Ranging (LiDAR) and RGB data simultaneously [124]. For road detection, Vo et al. applied decision trees on hand crafted features, aerial laser scanning and imagery is the basis for the hand created features [125]. This is similar to our work from the data perspective, because we also investigate the combination of geometric and color information for machine learning in remote sensing. However, our contribution is significantly different for three reasons. First, our geometric information is substantially inferior to that of airborne or terrestrial LiDAR, since it is derived from stereo matching applied to tri-stereo satellite imagery at 0.7 m Ground Sample Distance (GSD). Second, we use DL. Third, we apply DL directly on the point cloud without any preprocessing step, e.g., using voxels.

Reproducibility and Baseline (Scientific Approach in ML)
Machine Learning (see Maxwell et al. [126] for a practical review of machine learning in remote sensing), and especially DL, has made substantial progress over the past decade. Adding to the examples in the Introduction, Silfer et al. used reinforcement learning, an approach to learning that uses the reward and punishment paradigm [127], to master the game of Go [128]. DL was a crucial part of the work of Nadell et al., as they used it to discover metamaterial designs for energy harvesting [129]. The authors reported an average mean squared error of 1.16 × 10 −3 . In addition, the neural network is over five orders of magnitude faster than conventional electromagnetic simulation software. In medicine, the recent work of Stroke et al. shows that DL can discover antibiotics which are structurally divergent (in this case, e.g., halicin) from conventional antibiotics [130]. The researchers tested halicins bacterial activity and found it to work against tuberculosis and strains considered untreatable. However, scholars working in the field have recently pointed out worrying trends [131][132][133]. Dacrema et al. analyzed the progress of recent neural recommendation approaches [134]. They reported that only 7 out of 18 examined neural recommendation approaches can be reproduced with reasonable effort. In addition, six of these seven can be outperformed with simple heuristic methods such as nearest-neighbor. Fu et al. argued similarly, as they claimed that combining SVM with differential evolution (a method for tuning hyper parameters) achieves similar or sometimes better results than DL while having a significantly lower training time (10 min vs. 14 h) [135]. While the observation of a lack of reproducibility in machine learning, and deep learning specifically, is worrying, it is not unique to this field. Other research fields suffer from the same issue. For example, a reproducibility test conducted in Psychology revealed that over 50% of studies in the field are not reproducible, and two thirds of them should be distrusted [136]. Moreover, a survey in 2016 revealed that 70% of researchers fail to reproduce another scientists experiments and more than 50% fail to reproduce even their own experiments [137]. Gannot et al. gave an equally worrying overview of reproducibility and transparency in biomedical sciences [138]. This controversy invoked a discussion about the question "What does research reproducibility mean?" [139] and a discussion about how scientists can avoid fooling themselves [140]. Menke et al. used SciScore (https://www.sciscore.com/), a software tool that automatically scores a research paper on a ten point scale, to analyze 1.6 million PubMed Central papers [141,142]. They also used a Rigor and Transparency Index, which is the average score of analyzed papers in a particular journal. They found that fewer than half of the analyzed research papers address rigor and reproducibility criteria. Furthermore, perhaps surprisingly, their Rigor and Transparency Index did not correlate with the impact factors of journals. In summary, reproducing DL related research takes a sizeable amount of resources, if it is possible at all based on the information given in that research article, and the way baselines are chosen is questionable. We present our approach to address and resolve these issues in the next section, specifically in Section 3.2.

Materials and Methods
In this section, we describe how we derive the point clouds and the orthophoto from tri-stereo satellite images, the study site and the dataset, our approach for semantic segmentation in two dimensions, and our approach for semantic segmentation in three dimensions. In Section 3.1, we describe the study site and the dataset, followed by Section 3.2 where we approach to ensure reproducibility and describe our approach for choosing baselines. Finally, in Section 3.3, we present the generalization of the convolution operation and the used U-Net based architecture.

Study Site and Pléiades Dataset
Pléiades is a European VHR satellite system comprising two identical satellites, Pléiades 1A and Pléiades 1B, which were developed by Airbus Defence and Space for both civil and military uses. With a daily revisit capacity and a swath width of 20 km, the sensor can reach a ground resolution of 0.7 m (in panchromatic mode) from an orbit height of 674 km. Due to its fast rotation, the sensor is able to record three images within the same orbit in less than 1 min. The images provided for this work were acquired in the tri-stereo mode, from different along-track positions of the Pléiades-1B sensor: forward (FW)-, close to nadir (N)-, and backward (BW)-view.
The study area is located in Waldviertel, a region in Lower Austria, which is the northeastern state of the country (48 • 30'30" N; 15 • 08'34" E; WGS84) ( Figure 2). With elevations ranging from a minimum of 572 m above see level (a.s.l.) to a maximum of 749 m a.s.l., the territory is a hilly region covered mainly by forests, agriculture and grasslands. Rural, urban, and water surfaces are also present, but with a lower frequency of occurrence. For this study site, Pléiades images were acquired in the late morning of June 13th 2017. in the north-south direction, within less than 1 min. Therefore, they have the same illumination conditions, and shadow changes are not significant. Images were delivered as pan-sharpened with four bands (Red, Green, Blue, and Near-infrared) and spatial resolutions varying between 0.70 and 0.71 m, depending on the different viewing angles. The images were distributed as a primary product at sensor processing level, i.e., they contain corrections for sensor geometry and radiometric distortion, viewing angles, attitude, and ephemeris data. Detailed information about the acquisition properties of the tri-stereo pair are summarized in Table 1.
For each image, auxiliary data including the RPCs were provided by the supplier. To obtain sub-meter geolocation accuracy, the RPCs were refined by measuring 60 GCPs, i.e., points with known 3D coordinates, within the images. A Digital Terrain Model (DTM) generated from a LiDAR acquisition (December 2015) with 1 m spatial resolution was used for vertical quality assessment and for improving the absolute geolocation of the photogrammetrically derived DSM. We used a digital orthophoto from 2017 at 0.20 m spatial resolution for manually defining the planar locations of Check Points (CPs), whose corresponding heights were extracted from the LiDAR DTM. For this study, we selected a smaller test area (44.5 km 2 ) on which we trained and evaluated the generalized sparse convolutional neural network ( Figure 2).  The characteristics for the point cloud dataset are as follows. The whole dataset consists of 196,495,815 points. We followed the established rule of thumb in machine learning, where one splits the dataset into two parts: one part is roughly 80% of the total dataset and the other is roughly 20%. The former is our training dataset, and the latter our test dataset. Based on two polygon regions, the total number of points in the training set is 163,013,266 (82.96%), while the number of points in the test is 33,482,549 (17.03%). The same polygon regions were used to split the orthophoto as well. As shown in Figures 2 and 3, the class distribution is approximately the same for the training set as well as the test set. For the class weights experiment, we used median class weighting. This weighting approach weights each classes frequency f req c in relation to the median of the class frequencies median( f req c ), in our case to 2,769,222, the occurrence of roads (Table 2). Mathematically, this is written as These class weights hence downgrade the importance of classes with a high occurrence, and upgrades the importance of classes with a low occurrence. For example, for clutter, this equation is equal to 2769222 98718944 = 0.02, and, for vehicles, to 329.90. The level of gemetric distortion is clearly displayed in Figure 4, and Figure 5 shows the distribution for each feature.

Our Approach for Reproducibility and Choosing Baselines
In line with the reproducibility discussion and to address the problem of reproducibility, we clearly state that our study falls into the results reproducibility category because our data are private, as of this writing. Results reproducibility means, in the context of this paper, that one gets approximately the same results when using approximately the same data and the same training procedure. That said, we acknowledge the fact that acquiring approximately the same dataset requires sizeable amount of resources. However, we are actively working on open sourcing our data in the near future.
To ensure reproducibility and comparability of our GSCNN experiments, we took two actions. The first action is to set seeds for random number generators. This means that, in our case, we set seeds for the Python libraries numpy, as well as for PyTorch with numpy.seed(0) and torch.manual_seed(0), respectively. The second action was to set cuDNN in the deterministic mode with torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False. This ensures that cuDNN only uses deterministic functionality and, further, that cuDNN does not automatically choose the best algorithm.
To further address the issues and facilitate research and ease reproducibility, we made all 3D models after each epoch public on PyTorch Hub. Specifically, we open-sourced all obtained network parameters after each epoch for all cases, which totals 500 sets of network parameters (50 from the coordinates experiment, 50 from the coordinates and color experiment, and 400 from the coordinates and color and class weights experiment). This decreases the amount of resources researchers, or others, need to allocate if they want to use or validate our research. For example, if other researchers also work on three-dimensional semantic segmentation with their own dataset and they would like to compare their approach with our method, they only need to download the weights instead of going through the training process themselves. Additionally, we see this as a first step towards transfer learning [7] in the domain of semantic segmentation on derived point clouds from satellite imagery with convolutional neural networks. For further information on system requirements, installation process, and how the weights can be downloaded, please visit the GitHub project (https://github.com/MacOS/ReKlaSat-3D/) for this paper.
As pointed out in Section 2.4, establishing a simple and strong baseline is important. For this reason, we established a strong baseline for the test set in general and for each class. For the overall accuracy, we set 60.56% as the first naive mark for the training set, because we can always achieve that by predicting clutter. We achieve an even better overall accuracy for the test set with that approach (64.25%) ( Table 3). This is marked as baseline A. While this prediction method is not useful in practice, it is certainly a good reference point and highlights shortcomings of the respective performance metrics. Additionally, we compared the results with a decision tree approach, a supervised-feature learning algorithm [143]. In particular, we used the approach described by Waldhauser et al. [98]. The list of hand-crafted geometric features include, inter alia, the normal vector including its standard deviation (NormalX, NormalY, NormalZ, and NormalSigma), a local penetration rate (EchoRatio), statistical measures of the local vertical point distribution (ZRank and ZRange), and features derived from the structure tensor T (linearity, planarity, and omnivariance) [144]. For the training phase, we used 20% of the training data for training and the rest for cross validation. For the computation of the hand-crafted geometric features, Opals (Orientation and Processing of Airborne Laser Scanning Data) software is used [108]. We trained the decision tree with the following hyperparameters: (i) 0.00001 complexity factor; (ii) to attempt a split, a minimum of 20 observations has to exist in a node; (iii) each leaf node has at least seven observations; (iv) a maximum depth of 30 for the tree; (v) five competitor splits; and (vi) five surrogate splits. Our hardware set up for this is an Intel Core i7-4770K CPU @ 3.50 GHz, 32.0 GB RAM. Computing the features with this configuration takes 20 h 10 min, while the training time is 7 h 25 min for the geometric features and 9 h 22 min for the geometric and color features. The validation process takes 21 and 23 min for geometric and geometric and color features, respectively. More information on this approach is described in Appendix C.
We used FCN-8s, a FCNN with the standard convolution operation for semantic segmentation of the orthophoto, as proposed by Long et al. [90]. This network architecture combines features from three different levels of granularity for the prediction step. The first level of granularity is from the last layer, hence these features have the lowest level of granularity. These features have to be upsampled with a stride of 32. The second level of granularity are the features of the fourth out of five pool layers, and require to be upsampled with a stride of 16. Finally, the layer with the highest level of granularity to be used for prediction are the features from the third pool layer. These features are upsampled with a stride of 8, where the name FCN-8s originates from. Combining features with different levels of granularity improves the quality of segmentation, e.g., with sharper object boundaries. We used this network to assign a label to each pixel in the image I. If N and M denote the height and width of an Image in pixels I, and C the number of color channels for each pixel, then I can be interpreted as a tensor I with dimensions N, M, C, i.e., I NxMxC . The networks input features are red, green, and blue for each pixel. Consequently, we searched for φ : I NxMx3 → [clutter, road, building, vegetation, vehicle] NxM . We trained the network with a fixed learning rate of 0.0001 for 50 epochs with Caffe [106]. Appendix D and Figure A3 present a more detailed description of this network.

Generalized Sparse Convolutional Neural Networks for Derived Point Clouds
In this work, we are interested in the feature extraction capabilities of the generalized sparse convolution operation, as proposed by Choy et al. [91]. Generalized sparse convolution generalizes convolution for generic input and output coordinates, i.e., this operation does not have the implicit grid assumption of the standard convolution operation. In addition, it allows defining arbitrary kernel shapes. The conventional dense convolution in D-Dimensions is defined as: is the list of offsets in a D-dimensional hypercube. Choy et al. relaxed this equation by allowing arbitrary kernel shapes using predefined input and output coordinates, C in and C out . This leads to: for u ∈ C out N D is a set that defines the offsets of a kernel at kernel center u, i.e., arbitrary kernel shapes can be defined. This set is restricted to N D (u, i.e., the current coordinate center plus the offset has to be a coordinate in the input coordinates. Trivially, the offset needs to be defined. We use this formulation as a drop in replacement for the conventional dense convolution. The network architecture we use in this paper is U-Net based [92] (Figure 6). A variant of this architecture is, at the time of this writing, state-of-the-art on the ScanNet dataset [145]. This network architecture only uses hypercubes as its kernel shape. The circled plus represents element wise tensor addition, and concatenate takes two tensors as an input and appends the two tensors along the channel dimension. For example, the first concatenate from the top receives two tensors as an input: the output tensor of the first sparse convolution down-sampling layer with 32 feature channels via the skip connection and the output tensor of the last sparse transpose convolutional layer with 96 feature channels. Hence, the output of concatenate is a tensor with 32 + 96 = 128 feature channels. Numbers outside of rectangles indicate the number of feature channels that go in and out. For example, the first layer expects C input channels and outputs 32 channels. All sparse convolutional layers are followed by batch normalization, and, in addition, all have a dilation of 1. A sparse convolution layer has a stride of 1, if not stated otherwise. For a better understanding, we merge layers into the four buildings blocks: SparseConv ResNet Block (green), SparseConv ResNet Block 2 (blue), SpraseConv Downsampling (yellow), and SpraseConv Transpose (red). A SparseConv ResNet Block applies a sparse convolution, with kernel size 3, two times on the input tensor. The resulting tensor is then combined with the input tensor with element wise tensor addition before the ReLU operation. SparseConv ResNet Block 2 deviates from this in one way: it doubles the number of feature channels it receives in the first sparse convolutional layer. We therefore need an extra sparse convolutional layer, with kernel size 1, in the side stream before we can perform element wise tensor addition. SparseConv Downsampling is a sparse convolutional layer with kernel size 2, and, additionally, a stride of 2, which halves the feature maps, i.e., if the input is N points, it outputs N/2. This operation is reversed by SparseConv Transpose with the same hyper parameters. We describe the encoder (also down path) and the decoder (also up path) in the next two paragraphs, respectively.  The encoder, generally, takes the input tensor and down-samples it, while it doubles the feature maps. First, the network performs a sparse convolutional operation on the provided features of the input point cloud with kernel size 5 with 32 kernels, expanding the number of feature channels from C to 32. The kernel size of 5 allows the network to capture patterns in a larger neighborhood. Next, SparseConv Downsampling halves the size of the feature maps. This pattern, i.e., first downsampling the feature maps and then doubling the number of feature maps is repeated four times in total, with a growing number of ResNet Blocks between them: specifically, first by 2, then by 3, and then by 4. The last doubling of feature maps takes place in the first layer of the bottleneck (last blue rectangle in the encoder) which consumes 128 and outputs 256. This is followed by five SparseConv ResNet Blocks. Hence, the bottleneck consists of six ResNet blocks.
The decoder, generally, takes a tensor and upsamples it, while it halves the number of feature maps. The decoder's first layer is the first SparseConv Transpose layer from the bottom, it doubles the size of the tensor it receives while keeping the number of feature maps at 256. The resulting tensor is then concatenated with the output tensor of the encoder's last SparseConv Downsampling layer, which gives us a total of 384 = 256 + 128. Following that, the first SparseConv ResNet Block 2 in the decoder reduces this to 256. Next, a SparseConv ResNet Block operates on this tensor, and subsequently a SparseConv Transpose layer halves the number of feature maps down to 128. This sequence, concatenate, SparseConv ResNet Block 2, SparseConv ResNet Block, and SparseConv Transpose, is then repeated until the prediction layer. Note that the last SparseConv Transpose layer does not halve the number of feature channels.
We assign a label to each point in a point cloud, i.e., we search for φ : F N → [clutter, road, building, vegetation, vehicles] N , where F is the points feature tensor and N is the number of points in the point cloud. Note that we differentiate between a point and how a point is represented, i.e., their features f . Consequently, we also need a tensor that keeps track of the coordinates, which allows us to monitor the positions in a three-dimensional space. In the first experiment, a point was represented by its geometric features f i = [x i , y i , z i ] and in the second experiment also by its color f i = [x i , y i , z i , re i , gr i , bl i ]. Mathematically, we can write these two tensors as two matrices: one matrix C for the coordinates and one matrix F for the features. When we use geometry and color as features, these matrices are as follows: The b i denote the batch index a point belongs to. All features were normalized, i.e., min-max scaled, or more formally ∀ f ∈ F, f ∈ [0, 1]. We trained the network for 50 epochs with batch size 3. We optimized the parameters with RMSprop [146] and a learning rate of 0.001. Our hardware set up is 16 GB DIMM DDR 1333 MHz RAM, Intel Core i5-2400 with 3.1 GHz CPU, NVIDIA GTX 1080 Ti with 12 GB GPU, and as a motherboard Dell 0HY9JP version A00. With this configuration, one training loop takes approximately 3 h 45 min, and one evaluation loop approximately 45 min.

Results
We present our results in the following two subsections. Section 4.1 describes the results for dense image matching and 3D reconstruction. We then present (and compare) in Section 4.2 the segmentation results for the U-Net based GSCNNs, the decision tree, and for the FCN-8s (in this order).
We also describe the results for the GSCNN model that we trained for 400 epochs with class weights. Finally, we evaluate the overall performance of the models, as well as the class level performance.

Dense Image Matching and 3D Reconstruction Results
Image orientation and dense matching were performed for all three images covering the study area. The automatic tie point extraction identified approximately 552 points, with standard deviations between 0.38 and 0.52 pixels, while the total standard deviation of the bundle adjustment resulted in a value of 0.54 pixels. The dense image matching process was performed at full image resolution. Hence, for each (matched) pixel, a 3D object point was generated, whose coordinates were calculated by spatial forward intersections, finally resulting in a large photogrammetric point cloud with a high density of 4 points/m 2 (Appendix A Figure A1). Compared with stereo, the use of tri-stereo images results in a higher quality 3D reconstruction, since three-acquisition-direction reduces the risk of occlusions. Nevertheless, difficulties are encountered in regions with poor or repetitive texture, such as water surface or shaded areas, where the final 3D point cloud contains higher noise and outliers. The resulting point cloud describes the terrain surface in the open, free areas (agriculture and grasslands) and the top surface of the objects on it (buildings and trees). The described surface is smooth, with bevelled building edges and missing height information for small individual objects. For example, points corresponding to small vehicles (family cars with lengths smaller than 5 m and heights smaller than 1.5 m) are not elevated compared to their surroundings. Hence, a classification using solely geometry information cannot reliably detect such objects.
For the quality assessment, the point cloud was interpolated into a regular elevation model with 0.5 m resolution, i.e., DSM, by using a robust moving planes interpolation method. Therefore, the vertical Root Mean Square Errors (RMSEs) between the DSM and 50 homogeneously distributed CPs as well as the reference LiDAR DTM was computed (the comparison with the LiDAR DTM was only done in open, free areas, where the DSM can be considered the same with the DTM). Additionally, the distribution of the height difference errors was analyzed by computing statistic measures such as mean, standard deviation (σ), median, and a robust standard deviation based on the median of absolute differences (σ MAD ). The height differences between the computed DSM and the LiDAR DTM in open areas show a systematic positive difference (~1 m) in the southern part of the scene (Appendix A Figure A2). Such systematic error can be removed by applying Least Square Matching (LSM) [147] as demonstrated in Loghin et al. [148], where height differences are reduced to a median close to zero and the total Root Mean Square Error (RMSE) decreases from 0.96 m to 0.61 m (Appendix A Table A1). For the 50 CPs, the RMSE dropped from 0.31 to 0.25 m after LSM, which is very close to one-third of the pixel size (0.35 pixel).
For completeness, it should be mentioned that GCPs are not mandatory for processing satellite imagery. Based on the original RPCs and automatically measured tie points, a DSM can be derived as well. However, the geolocation accuracy of such model is poor, as in our case we observed a positive offset of approximately 18 m. Applying a LSM transformation as described before, similar accuracies can be achieved, as with the usage of GCPs.

Semantic Segmentation Results
In this section, we describe and compare the performance of the trained models. We start with the GSCNN models using class weights based on 50 epochs. Then, we proceed with the decision trees, and follow up with FCN-8s. Finally, we describe the GSCNN model that uses class weights over the 400 epochs. Before that, we give a brief rationale on the chosen performance metrics.
We compare the performance of the machine learning models with metrics that do not take the number of classes into account, and with metrics that do. We do this for many reasons, among them are the following three. The first is to get a more detailed picture on how a particular model actually performs. This is especially important when one or two classes dominate. Second, and tied to the first, performance metrics that do not take the number of classes into account put, a mostly unjustifiable, high importance on the dominant classes. In this paper, for example, adding color as features to the coordinates does improve the performance of the GSCNN model judged by the kappa score and the overall accuracy. However, it decreases the performance judged by metrics that take the number of classes into account, e.g., the average class accuracy. Third, in some cases, we are interested in the prediction performance of a particular class. An organization that is in the oil pipeline surveillance business is generally more interested in vehicles than in buildings. For these reasons, we compare the performance with four metrics that take the number of classes into account by averaging and with two metrics that do not. We also break the performance down to the class level for accuracy, precision, recall, and F1.
Adding color information to geometry does increase the kappa score and the overall accuracy of our U-Net based GSCNN (Table 3). In addition, the color information has increased the class metrics for clutter on accuracy, recall, and F1, while it has decreased the precision (Table 4). In fact, the increase in kappa score and overall accuracy is solely due to the increased performance on predicting clutter. However, this increase has a negative effect on all other metrics. In general, we see lower values for average precision, average recall, average F1, and average accuracy. The same is true for the per class metrics, with the exception of the aforementioned clutter class. The two dominant classes clutter and trees are overwhelmingly present in false positives and false negatives (Tables A3 and A5). Interestingly, adding color features does not have a homogeneous reducing effect on this. For clutter, the number of times tree is predicted increases from 3.6 million to 10.2 million. For trees, we observe the opposite effect as the number of times clutter is predicted, decreasing from 9.3 million to 1.4 million. Adding class weights to the color and coordinates model generally reverses the described effects, albeit not completely, with the roads class being the exception as it increases the performance. This means that, if a value decreases with the additional color features, it increases with the class weights, and vice a versa. In Table 3, for example, the weighted loss increases average precision, average recall, average F1, and average accuracy while it decreases the kappa score and overall accuracy. This effect is also present in Table 4, and when we compare Tables A5 and A7. These decreases can be attributed to the effects of the weighted loss, as the weights decrease the impact of classes with a high occurrence and increase the impact of classes with a low occurrence on the network's loss. Adding color to the coordinates leads to a higher stability over the training process (compare Figures 7 and 8), while median class weights seem to introduce instability again (compare Figures 9 and 10). Table 3. Overall quantitative comparison of the GSCNN, FCN-8s, and the decision tree. We report (from left to right) average precision, average recall, average F1 score, average accuracy, kappa score, overall accuracy, and MCC. Values in bold format are the highest numbers for the corresponding metrics, with the exception for baseline A. See Table 4 for per class results, and Appendix B for the confusion matrices. Note that we abbreviate weighted loss with (W.L).

Models
Avg The additional color information boosts the decision tree's performance on all classes, but one, among all metrics (Table 4). Adding color information has no effect on the vehicle class. The per class accuracy improved from 93% to 95% for clutter, from 88% to 93% for trees , from 11% to 79% for buildings, and from 0% to 24% for roads. This 24% for roads is the best performance among the three approaches. While two of these improvements are dramatic, they only lead to a modest improvement for the kappa score and the overall accuracy, due to the class imbalance. However, the average per class accuracy improves by 20% from 38% to 58%, which is only 3% lower than the average per class accuracy of FCN-8s. The additional color information also leads to the best precision and F1 score for buildings and the best recall for roads. Hence, the decision tree outperforms FCN-8s on these cases, when simultaneously trained on both hand-crafted geometric and color features. When only trained on the hand-crafted geometric features, the decision tree suffers from the dominance in false positives and false negatives of the two classes clutter and trees as the GSCNN, although not as strongly as the GSCNN (Table A13). However, the additional color information partially mitigates this (Table A5). Overall Accuracy (OA) on the Test for GSCNN Models after each Training Epoch.  FCN-8s performs the best among the three approaches in our study area, with an average precision of 62%, an average recall of 61%, an average F1 score of 59%, a kappa score of 90%, an overall accuracy of 96%, and an average per class accuracy of 61% (Table 3). However, the performance of the decision tree when color is included is only 1-3% lower for these metrics. FCN-8s confusion matrix shows that it does not suffer as strongly as the other two from the class imbalance (Table A8). For example, for clutter, the second biggest group of false positives is roads and not trees. FCN-8s also leads in most of the per class metrics (Table 4); it is outperformed by the decision tree when it is trained on hand-crafted geometric features and color on four occasions. First, the decision tree has a higher accuracy for predicting roads. Second, the decision tree has a higher precision for predicting buildings. Third, the decision tree has a higher recall for predicting roads. Fourth, the decision tree has a higher F1 score for buildings.    Training the GSCNN on color and geometry, with a weighted loss, mitigates the class imbalance problem ( Figure 10 and Table 4). The network commonly has accuracies different from 0 for buildings and vehicles between 100 and 250 epochs. However, after the 250th epoch, they become less and less frequent and smaller. The volatility of the overall accuracy, measured by the standard deviation σ, is five times higher than the coordinates model. In particular, σ = 10.22%, which is similar to the coordinates model σ = 9.99%, while, for the model that uses coordinates and color, we have σ = 2.02%. MCC almost completely recovered after 400 training epochs to 0.17, which is almost twice as high than without colors. In other words, it almost completely recovered.

Discussion
In this section, we analyze the 3D reconstruction result in the first paragraph, and interpret the semantic segmentation outcome in the context of CV in the second paragraph. Finally, we outline future research directions.
As described in Section 4.1, the accuracy of the computed DSM was analyzed based on 50 homogeneously distributed CPs. We observed a vertical RMSE of 0.25 m. Our result is therefore better than the accuracies of 0.49 m reported by Bernard et al. [149] and of 0.5 m obtained by Hu et al. [150] for 6001 ground LiDAR check points. We picked CPs on flat, smooth, and non-vegetated areas, whereas, at larger scale, the surface land cover type [151] will influence accuracy assessments. Hence, we compared the Pléiades DSM with the reference LiDAR DTM in the open, stable areas, only. However, we did not explicitly exclude low vegetation and agriculture fields, making the larger global RMSE of 0.61 m (0.9 GSD) plausible. Nevertheless, the result corresponds to reported vertical accuracies of 0.44 and 2 GSD for photogrammetrically derived elevation models captured from aerial platforms [151][152][153].
DL advanced a variety of tasks in NLP and CV, especially in cases where the problem can be formulated as a set of features that can be arranged in a grid structure, e.g., images. Recently, this assumption of a two-dimensional grid is challenged by the machine learning community, as more effort is put into generalizing the convolution operation, e.g., with the previously mentioned pointwise convolutional operation [123], and the generalized sparse convolution operation we use in this paper [91]. While the GSCNN is outperformed by both models, we argue that our results indicate that the generalized sparse convolution operation is a promising candidate as an approach, especially when we acknowledge the fact that learning geometric patterns in a three-dimensional space seems to be a harder learning problem, as indicated by our experiments, which is particularly true for derived point clouds from tri-stereo imagery as the geometric information quality of objects is lower. That leads to difficult decision boundaries, as these boundaries are diffused. Additionally, we have to acknowledge that learning in a higher-dimensional space, 3 versus 6, is per se more difficult due to the curse of dimensionality [154]. Our experiments also indicate that median class weighting improves the performance on rare classes, even with a factor of 1:10.000. However, it requires 2-3 times more epochs (and hence a longer training time) for these performance improvements to be visible in out setting. We want to highlight that data augmentation might drastically improve the performance of the tested model as this is a widely observed behavior in DL. Class imbalance strategies are also likely to greatly boost the performance of DL in our case, e.g., the approach presented by Messay-Kebede et al. greatly boosts the accuracy from 40.5% to 81.0% on class Simda, where only 42 samples are present [155].
Future research directions include the use of DL for 3D reconstruction. For example, Saito et al. presented a technique that takes one or more images as an input, and outputs a 3D reconstruction of an object [156]. The findings of this research direction might give valuable input for improvement of the currently used hand crafted mathematical models that we use in this paper. The applicability of superresolution on the derived point clouds is also worthy of future investigation, similar to what was done in the two-dimensional grid world with images by, e.g., Kim et al. [157]. Not only might this highlight areas for improvement of the mathematical models, but it might also lead to less distorted point clouds, which in turn can significantly improve not only the segmentation results but also other tasks, e.g., object detection. Another research direction is to conduct a comparison study, e.g., comparing our approach with pointwise convolutional networks [123] and pointnet based networks [60]. Investigating the uncertainty of the model is also an interesting and important avenue for further research [158]. Findings from this research help understand in which scenarios the model is likely to be wrong, which, in turn, would lead to strategies that limit and mitigate these scenarios. Information about uncertainty is important for deploying and incorporating machine learning models into business processes (see above). An ablation study on the network's components is also a direction. For example, systematically decreasing the number of filters gives insight to the question of model complexity. Finally, a study on different class imbalance strategies will contribute to answering the question which strategies will improve the performance in settings such as ours.

Conclusions
We studied the utility of point clouds that are derived from tri-stereo satellite imagery for semantic segmentation with a U-Net based architecture that uses generalized sparse convolutions. We used an Austrian study area. For this study area, we reported that, ceteris paribus, color and geometry (from derived point clouds) do not lead to an increase in general performance for the GSCNN. In our experiments, we found that, inter alia, adding color to geometry leads to a higher performance only on the dominant class. We also found that median class weighting partially reverts the effects of adding color. For example, MCC drops from 0.18 to 0.00 when color is added, and it raises to 0.09 with median class weighting in our experiments. The network also started to learn the classes with lower occurrences. In addition, FCN-8s and the decision tree outperform the GSCNN. Providing the GSCNN with the additional color information does increase the overall accuracy from 56% to 62%, and the kappa score from 39% to 45%, but this is at the expense of a significantly lower per class accuracy for the decision tree. The accuracy performance for roads drops from a modest 2.25% to 0.00%. This leads to a fall of the average per class accuracy from 24% to 19%. For the decision tree, adding color information to the geometry leads to a significantly higher performance. The kappa score increases from 82% to 86%, the overall accuracy from 89% to 93%, and the average per class accuracy from 38% to 58%. Adding color has a particularly strong effect on the class accuracy performance for roads and buildings. The highest performance on this study area has FCN-8s with a kappa score of 90%, an overall accuracy of 96%, and an average per call accuracy of 61%. This network is only outperformed by the decision tree as the latter has a 2% higher accuracy for roads prediction. While the GSCNN is outperformed by and large by the decision tree, it has major shortcomings. One is the longer prediction phase. With their respective hardware settings, the GSCNN takes 45 min for the test dataset, while computing the features for the decision tree takes approximately 4 h with our hardware set up. Additionally, one has to set up the infrastructure for this. We contribute in multiple ways with our research. Firstly, we open up a new possibility for data fusion in remote sensing, specifically in the case of applying machine learning [159]. To the best of our knowledge, this is the first work that studies point clouds derived from tri-stereo imagery for DL approaches. Second, similarly, our work extends the semantic segmentation literature to satellite derived point clouds. Third, we contribute to transfer learning, by making our class weights for the GSCNN networks open source on PyTorch Hub via https://github.com/MacOS/ReKlaSat-3D. Additionally, it makes our research more accessible to other researchers by drastically decreasing the amount of resources necessary to use our results. We also make the learned decision trees publicly available through this GitHub repository.

Conflicts of Interest:
Author Nikolaus Schiller was employed by the company Vermessung Schmid ZT GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1 illustrates the results of image matching, with an overview of the reconstructed point cloud together with color and shaded DSM visualizations for the study area. Figure A2 shows the color-coded elevation differences between the photogrammetric-derived DSM and the reference LiDAR DTM with their frequency distribution histogram before and after applying LSM transformation.

Appendix B
In this appendix section we depict the confusion matrices for all machine learning methods and each set of input features. Tables A3 and A5 show the confusion matrices for the GSCNN for the input features coordinates, and the input features coordinates and colors, respectively. Table A8 shows the confusion matrix for the two-dimensional CNN approach. Finally, the confusion matrices for the decision tree trained on coordinates, and trained on coordinates and colors are shown in Tables A12 and A13, respectively. Please note that the numbers in Tables A3 and A5 do not add up the numbers  presented in Table 2, because the current implementation of the Minkowski Engine cannot process point clouds with duplicate points. For example, it does not allow that f 1 = [2, 2, 2] and f 2 = [2, 2,2] are in the same point cloud pl, i.e., if f 1 ∈ pl =⇒ f 2 / ∈ pl and if f 2 ∈ pl =⇒ f 1 / ∈ pl. We had to delete 859 points for this reason. The numbers in Table A8 do not add up to Table 2, since the numbers reflect pixels and not points. The normalized confusion matrices are normalized along the rows, i.e., along the actual (also true) classes.

Appendix C
As a pre-processing step, additional geometric features were computed for each 3D point in the matched cloud. To this point, we considered for the spatial query of a point neighborhood an infinite cylinder with 7 m radius. These features include, e.g., the normal vector (NormalX, NormalY, NormalZ, NormalSigma), EchoRatio, ZRank, ZRange, and features derived from the structure tensor T (linearity, planarity, and omnivariance). All these features describe the point distribution [144] and are required for the separability of the classes. Surface normals are an important geometric property for the description of a surface. The local tangent plane is estimated by computing the best fitting plane for the nearest points and its corresponding normal vectors (termed NormalX, NormalY, and NormalZ) together with the standard deviation of the fit are used as additional descriptions of the points (NormalSigma). The distribution of the points in the neighborhood is derived from the structure tensor T, from which three main features are extracted: linearity, planarity, and omnivariance. The linearity feature is used to characterize 3D line objects such as power lines, planarity is a feature describing the smoothness of the surface, and omnivariance gives information about the volumetric point distribution. EchoRatio is a measure that describes the vertical point distribution, ZRange represents the maximum height difference between the points in the neighborhood, while ZRank is the rank of the point corresponding to its height in the neighborhood [160].

Appendix D
We describe FCN-8s as introduced by Long et al. in greater detail in this appendix section ( Figure A3). Neural network layers are depicted in rectangles, the rectangles are colored if we have collapsed multiple layers into one to increase readability. Numbers outside of rectangles indicate the number of feature maps that go in and out of a layer. For example, the first layer expects three input channels and outputs 64 channels. Crop takes two tensors as an input, and crops the larger one of the two down to the size of the smaller one. The circled plus represents element-wise tensor addition. The down path is described in the next paragraph, and the up-path in the last paragraph of this appendix section.
A major point of reference for this architecture are the five max pooling layers with a kernel size of 2 × 2, a stride of 2, and no padding (colored red). These max pooling layers successively downsample the feature maps. The first two max pool layers are preceded by two convolutional layers each, while the last three are preceded by three convolutional layers. A convolutional layer consists of a convolution with a kernel of size 3 × 3, a stride of 1, and a padding of 1, followed by the activation ReLU function (colored green). The last max pool layer is continued by a convolutional layer with a kernel of size 7 × 7, a stride of 1, a padding of 1, and outputs 4096 feature channels. The output of this layer is then put through the ReLU activation function, dropout is then applied with a rate of 0.5 [161]. This is followed by a convolutional layer with a kernel of size 1 × 1, with a stride of 1, and a padding of 0, on which ReLU and 0.5 dropout rate is applied (colored blue). The down path ends with a convolution layer with a kernel of size 1 × 1, with a stride of 1 and a padding of 0, that outputs 21 features channels.
The 21 output channels of the last layer are upsampled by a deconvolutional layer with a kernel of size 4 × 4, a stride of 2, and a padding of 0. This is then cropped to fit the output size of the fourth max pool layer, before element-wise tensor addition is used to merge the two tensors. After that, a deconvolutional layer with the same hyperparameters as before is used to upsample the tensor again. The same operations are performed to merge this resulting tensor with the output of the third max pooling layer. Finally, this tensor is then upsampled by a deconvolutional layer with kernel size 16 × 16, stride 8, and padding 0. This final tensor is then cropped to fit the size of the input tensor.