Mapping with Pléiades—End-to-End Workflow

In this work, we introduce an end-to-end workflow for very high-resolution satellite-based mapping, building the basis for important 3D mapping products: (1) digital surface model, (2) digital terrain model, (3) normalized digital surface model and (4) ortho-rectified image mosaic. In particular, we describe all underlying principles for satellite-based 3D mapping and propose methods that extract these products from multi-view stereo satellite imagery. Our workflow is demonstrated for the Pléiades satellite constellation, however, the applied building blocks are more general and thus also applicable for different setups. Besides introducing the overall end-to-end workflow, we need also to tackle single building blocks: optimization of sensor models represented by rational polynomials, epipolar rectification, image matching, spatial point intersection, data fusion, digital terrain model derivation, ortho rectification and ortho mosaicing. For each of these steps, extensions to the state-of-the-art are proposed and discussed in detail. In addition, a novel approach for terrain model generation is introduced. The second aim of the study is a detailed assessment of the resulting output products. Thus, a variety of data sets showing different acquisition scenarios are gathered, allover comprising 24 Pléiades images. First, the accuracies of the 2D and 3D geo-location are analyzed. Second, surface and terrain models are evaluated, including a critical look on the underlying error metrics and discussing the differences of single stereo, tri-stereo and multi-view data sets. Overall, 3D accuracies in the range of 0 . 2 to 0 . 3 m in planimetry and 0 . 2 to 0 . 4 m in height are achieved w.r.t. ground control points. Retrieved surface models show normalized median absolute deviations around 0 . 9 m in comparison to reference LiDAR data. Multi-view stereo outperforms single stereo in terms of accuracy and completeness of the resulting surface models.


Introduction
For many applications in remote sensing, highly accurate and up-to-date mapping information gathered from very high resolution (VHR) satellite stereo images is needed. Mapping, in general, can be described as the process of generating 3D maps of a region of interest on the Earth's surface consisting of 3D coordinates linked with the spectral information that was observed by the underlying sensor. Exemplary applications in the domain of remote sensing are, for instance, city modeling [1][2][3], forest assessment and biomass estimation [4][5][6], change detection [7,8], land cover and land use classification [9,10], carbon reporting [11], farm land monitoring [12], glacier observation [13,14], disaster damage mapping [15,16] and mapping in general [17,18]. To allow semantic analysis those applications need mapping products in form of digital surface models (DSM), digital terrain models (DTM), their difference, that is, normalized digital surface models (nDSM) and the according multi-spectral ortho-rectified images. All this information can be extracted from stereo or multi-view satellite imagery. Therefore, this work focuses on the generation of those 3D mapping products on the example of images gathered from the Pléiades satellite constellation. Scientifically, this work is based upon our previous publications [19][20][21] which are extended, refined and completed. The central contributions of this work can be summarized as follows: • An end-to-end workflow for mapping with satellite data is defined, especially tailored to the Pléiades sensors. Each component of this workflow is reviewed in detail.

•
The specific implementation is described, where many insights are presented which often focus on efficient implementation yielding speed-up.

•
Many extensions of the classical stereo processing chain are explained, including epipolar rectification, stereo matching details, fusion methods and also assessment details. • A novel, robust and fast DSM to DTM conversion algorithm is presented that is very beneficial for various applications. • For the first time in remote sensing, a multiple view geometry data set is processed comprising six Pléiades images that form 15 stereo pairs in along and across track direction. Note that other multiple view combinations were already processed in, for example, References [22][23][24][25]. • A comprehensive assessment is performed based on 24 Pléiades images gathered with a variety of acquired scenarios and distributed over the globe. In particular, assessments of the accuracy of 2D and 3D geo-location, such as DSMs and DTMs, are performed. Additional, results on stereo, tri-stereo and multi-view data sets are analyzed in detail.

•
Directions of future research are pointed out, such as finding optimized matching cost functions or automatic ground control points (GCP) transfer from synthetic aperture radar (SAR) data to optical images which will then complement the whole workflow.
The special interest in using Pléiades data to generate those needed mapping products comes from the fact that this sensor can acquire tri-stereo panchromatic images in one single over flight (single pass along track stereo capacity) in high spatial resolution. In addition, several works show that it is possible to derive highly accurate DSMs from such stereo or tri-stereo images [19,21,26,27]. Note, that all presented and proposed methodologies in this work are not limited to this specific Pléiades satellites but can be directly applied to any satellite imagery with rational polynomial coefficient based sensor models. For instance, for Ikonos, Spot-7, Spot-8, WorldView-3, WorldView-4 and also the upcoming Pléiades-Neo constellation. The presented methods are all implemented within the commercial software package Remote Sensing Software Graz (RSG) (http://www.remotesensing.at/ en/remote-sensing-software.html) and thus are available for the remote sensing community.

The Pléiades Satellite System
The European Pléiades satellite system is a dual system comprising the two identical satellites Pléiades-1A and Pléiades-1B. They have been launched in December 2011 and in December 2012, respectively, both providing VHR image data (where VHR is defined for resolutions or ground sampling distances (GSD) below 1 m [28]). The satellites operate in the same orbit with an offset of 180 degrees to offer a daily revisit capacity. They also share the same orbit as the satellites Spot-6 and Spot-7 but are positioned 90 degrees phase shifted. They are supplied with remarkable agility, as the pointing angles can be triggered in a range of ±47 degrees (standard mode ±30 degrees). The sensors are capable of acquiring a panchromatic band (470-830 nm) with 0.7 m GSD at nadir and four multi-spectral bands (blue: 430-550 nm; green: 500-620 nm; red: 590-710 nm; near-infrared: 740-940 nm) with 2.8 m GSD. Images of the Pléiades sensor get delivered as a bundle of a panchromatic band upsampled to 0.5 m GSD and multi-spectral bands at 2.0 m GSD in GeoTIFF or JPEG format. The multi-spectral information can be pansharpenend [29,30] or, alternatively, a multi-spectral pansharpend product can directly be ordered. The swath width of Pléiades image data is 20 km on ground, that corresponds to 40,000 pixels in across track direction. Interestingly, the Pléiades open star cluster was the best place to steer the eponymous satellite in the calibration phase [31]. For additional information on the Pléiades sensors we refer to References [27,[32][33][34][35].
The satellite agility is of importance with respect to 3D mapping from Pléiades image data. Similar to the present VHR missions of Ikonos, Spot or WorldView, stereo data can be acquired during one overflight (single pass) through an appropriate forward and backward arrangement of the sensor. A significant innovation and advantage of Pléiades, however, is provided through the capability to acquire even three images for an area, taken from the same orbit at along track forward-, nadirand backward-view of the sensor and through the possibility of an across track swipe. Such image triplets are also denoted as tri-stereo data sets. An assessment of the benefit of such arrangements was made, for example, in Reference [36], with respect to image triplets composed from multi-sensor and multi-temporal acquisitions, respectively. In addition to that, the Pléiades sensors are also able to steer in across track direction such that they can collect images over the same scene on ground from different orbits yielding also across track stereo pairs. This enables the collection of multiple view geometry data sets (term taken from Reference [37]) as well known from computer vision and photogrammetry. The sensor model is delivered as a rational polynomial coefficient (RPC) model. The geo-location accuracy of this model is reported to be in the order of 8.5 m CE90 [32]. According to Reference [38] CE90 is defined as the radial error which 90% of all errors in a circular distribution will not exceed, that is, the 90th percentile of circular error distribution. Therefore, for accurate mapping, the geo-location accuracy has to be refined, that is, optimized using GCPs. Additionally, the launch of the Pléiades-Neo satellite constellation is planned between 2020 and 2022 [39]. It consists of four satellites with a panchromatic band at 0.3 m GSD, six spectral bands (deep blue, blue, green, red, red edge, near-infrared) at 1.2 m GSD, a swath width of 14 km, an envisaged geo-location accuracy of 5 m CE90 and a mission lifetime of more than 10 years.

End-to-End 3D Mapping
This work proposes a complete end-to-end workflow necessary for 3D mapping, where the main goal is to automatically and robustly derive the mapping products from the input data. The input data consists of (1) multi-view satellite images, that can consist of one or multiple stereo pairs, one or multiple triplets or arbitrary combinations like one along track triplet and a single image from another orbit and (2) related metadata, which are the sensor model for each image, reference GCPs and their 2D image coordinates in each given image. Generally, the image coordinates are manually measured within an image-based interactive measurement tool. This information is then employed in an automatic workflow that is comprised of four core tasks sketched schematically in Figure 1. Those tasks are sensor model adjustment, DSM generation, DTM generation and ortho-rectification. The final output of the proposed process are a DSM, a DTM, an nDSM and an ortho-rectified image mosaic, that is, the envisaged 3D mapping products. The flow graph for those tasks is depicted in Figure 2 with the following logic and objectives:

1.
As the initial sensor models are too inaccurate for precise mapping (cf. Reference [21]), those models have to be optimized in the first task. As depicted in the flow graph in Figure 2, the sensor models of each image are optimized separately in the parameter adjustment by employing the initial sensor model together with the GCPs and the image coordinates. The adjusted sensor models are later used in DSM generation and ortho-rectification. Within this task the geo-location accuracy is assessed in detail, that is, assessment of a-priori accuracy as well as assessment of the need and feasibility for optimization, typically by means of GCPs and least squares adjustment. For several test sites and acquisition scenarios, an investigation regarding 2D and 3D accuracy assessment is given. Optimization necessities and potentials are demonstrated in this context. 2.
The second task takes all images and their adjusted sensor models as input and derives one final DSM. First, an intermediate DSM is generated from each possible stereo pair. As shown in Figure 2, we propose the following production chain: Epipolar rectification based on the adjusted sensor models followed by stereo matching that yields a disparity map. Spatial point intersection

Related Work
The proposed end-to-end workflow for 3D mapping with VHR Pléiades satellite stereo imagery touches various scientific independent topics that are reviewed in this section. Due to the goal of end-to-end mapping, those topics are intertwined like pieces of a puzzle, together bringing the complete processing chain into being. The presented topics are multidisciplinary and touch the fields of remote sensing, photogrammetry, computer vision and applied mathematics. This section follows the structure as given in Figure 1, thus touching the four core tasks of model adjustment, DSM generation, DTM generation and ortho-rectification, including all required sub-tasks.

Sensor Modeling and Parameter Optimization
For Pléiades images, a set of rational polynomial coefficients (RPC) is given, which define the rational polynomial sensor model and thus substitute a physical sensor model. This generic solution dates back to Reference [40] and for some sensors replaced the physical model. Many sensor models of optical VHR satellites are delivered with RPCs, such as Cartosat-2, CBERS-2, Deomis-2, EROS B, GaoFen-2, GeoEye-1, Ikonos, IRS-P5, Jilin-1, Kompsat-2, Kompsat-3, Kompsat-3A, QuickBird, Pléiades, RapidEye, SkySat, Spot-6, Spot-7, Spot-8, SuperView-1, TeLEOS-1, TripleSat, WorldView-3 or WorldView 4. Also some SAR sensor models, like for instance ERS-1, are supplied with RPCs. The main advantage of using the RPC model is that it presents a general solution for sensor modeling such that novel sensors are automatically supported by a given implementation. The disadvantage is that the coefficients are not directly related to physical properties like, for instance, sensor position or viewing direction. Therefore, in the adjustment procedure (cf. Reference [41]) it is hard to interpret which effect caused the change of coefficients. Mathematically, a non-linear equation system has to be solved to adjust the sensor parameters. This can be accomplished with numerical methods, such as Newton's method, Gauss-Newton algorithm, gradient descent or Levenberg-Marquardt algorithm. Depending on the number of released parameters and starting values the adjustment delivers robust solutions or gets stuck in a local minimum.
There is also an ongoing discussion on which parameters to be optimized. For instance, only a shift is considered in References [42,43] while linear coefficients are also considered in References [21,40,44,45].
The need to manually measure GCPs and their related image coordinates represent the main bottleneck of mapping. Therefore, automatic methods for GCP definition and transfer are of utter importance. One idea is to employ SAR images, like from the TerraSAR-X sensor. This sensor provides extremely high geo-location accuracy [46] and thus perspectives to retrieve highly accurate 3D point information via radargrammetric processing [47][48][49]. Then such GCPs can be transferred into the optical Pléiades image via multi-modal image matching as presented, for instance, in References [50][51][52].

Digital Surface Model Generation
Accurate 3D reconstruction from stereo or multiple images is a well-known and well-researched topic in computer vision, photogrammetry and remote sensing. As depicted in Figure 1 the main steps are epipolar rectification, stereo matching, spatial point intersection, DSM resampling, DSM fusion and the related post-processing.

Epipolar Rectification
Epipolar rectification for push-broom images can be seen as a solved problem [53,54]. However, in contrast to perspective geometry, where analytic solutions exist for epipolar rectification (cf. References [55,56]), such rectification can only be approximated for push-broom images. The epipolar curves can be estimated by using the modified collinearity equations [57] and are hyperbola-like non-linear curves [58]. Some authors ignore the specific mathematical formulation of the push-broom collinear equations and treat the images as being perspective. These solutions work on a local level, that is, rectifying small parts of the images separately as shown in References [59,60]. In reality, it is sufficient if the location differences orthogonal to the epipolar direction (e.g., the column direction of the epipolar images) are below half a pixel [21]. Recently, it was also shown that an epipolar rectification with small residual errors can be achieved for high-resolution SAR satellite images [54]. Alternatively, epipolar rectification might be skipped if the subsequent stereo matching algorithm searches along the (non-linear) epipolar curves such as in Reference [61]. The drawbacks of this approach are that the image matching algorithm has to be general and has to consider the geometric constraints of the underlying sensor model. Additionally, for each location in the search space along the epipolar curve the local search window has to be resampled during matching.

Stereo Matching
Within the standard processing chain, the most challenging part is the correspondence problem, where in the dense case for each pixel coordinate in the first image the corresponding pixel coordinate in the second image which holds the same object has to be determined. The process of finding those parallaxes, shift vectors or disparities is called image matching. In principle, the matching problem determines the column and line shifts for each pixel and thus is based on a 2D search space. However, using the epipolar rectification for each pixel in the first image the corresponding pixel in the second image is located on a horizontal line. In this case, the search space in image matching becomes one-dimensional, thus enabling faster and more accurate matching algorithms. A very good benchmark of 1D matching methods can be found in Reference [62]. Having epipolar rectified images a standard procedure in photogrammetry and remote sensing is to apply the semi-global matching (SGM) introduced by Reference [61]. This matching scheme is able to achieve very good results, comparable to methods based on global optimization, for example, graph cut [63] or total variation [64]. Moreover, it is computationally very efficient. Currently many commercial packages for processing satellite data use epipolar rectification and SGM for DSM generation, for example, Erdas Imagine (http://www.hexagongeospatial.com/products/remote-sensing/erdas-imagine/overview), PCI Geomatica (http://www.pcigeomatics.com), Trimble Inpho with Match-T or MATCH-3DX (http://www.trimble.com/imaging/inpho.aspx), s2p (http://dev.ipol.im/~carlo/s2p/), SURE (http: //nframes.com/) (only works with perspective images [65]), MicMac (http://micmac.ensg.eu) [66], RPC Stereo Processor (http://u.osu.edu/qin.324/rsp/) [67] and also the Remote Sensing Software Graz (http://www.remotesensing.at/en/remote-sensing-software.html). Recent research reports on different variants of SGM, such as Reference [65] and extensions like those in References [68][69][70][71]. As underlying cost function most algorithms apply the Hamming distance of the Census transform [72] due to its robustness [73]. Alternatively, the so-called modified Census Transform [74,75] is used. Also a weighting can be incorporated based on the radiometric or the geometric distance (similarity and proximity weighting) [75,76]. Optimizing the stereo matching technique is always an important research topic. In particular, a combined cost-function of Census transform and absolute difference measure is envisaged but also the replacement of Census transform with generalized Census transform [77,78] is an issue. The most recent step is to train the cost-function via deep learning of a convolutional neural network [79][80][81][82].

Spatial Point Intersection
Using the sensor models and the disparities 3D points are extracted using spatial forward point intersection. For each matching result, there are four equations (two per image) to determine the 3D ground coordinates. The according non-linear equation system can be solved by linearization of the sensor model followed by a least-squares adjustment to find the coordinate increments. The spatial forward intersection leads to a 3D point cloud non-equally distributed in space [21].

DSM Resampling
There exist different paradigms, like storing the 3D information as point clouds [83,84] or as a DSM with given spacing. The first option is state-of-the-art in aerial photogrammetry while the second one is mostly used in satellite-based remote sensing. Additionally, in the second case, all information from a multi-stereo data set can simultaneously be used in the spatial point intersection to achieve a higher over-determination in the least-squares approach [36]. Another method is to extract DSMs from each stereo pair and then fuse those DSMs to a final DSM [85][86][87]. This method has the advantage that multiple stereo pairs can be processed independently in parallel (e.g., on multiple PCs), while only the fusion process has to access all data. There are several methods for mapping the 3D point cloud onto a DSM, for example, interpolation of a regular grid of height values and filling the remaining gaps by an appropriate interpolation mechanism [21] or by LiDAR-based approaches like that in Reference [88].

DSM Fusion
As mentioned above multiple DSMs stemming from individual stereo pairs (e.g., from a triple acquisition or neighboring stereo acquisitions) have to be fused into one final DSM. One option is to apply local approaches like mean, median or probability model-based filtering [85]. The other option is to state a global optimization problem to be solved numerically via total variation [87] or via gradient descent [86].

Post-Processing
Appropriate post-processing is always of interest, also at different stages of the presented workflow. For instance, occluded regions may be filled with an interpolation technique that filters towards the ground level. This avoids blurring of 3D breaklines while still gathering a high spatial reconstruction coverage. A post-processing option can also be seen in the so-called peak filter that segments images into radiometrically connected regions [61,89]. Small or isolated regions can then be discarded.

Additional Notes
In general, there exist workflows that can extract DSMs from Pléiades stereo or tri-stereo data. However, it is difficult to evaluate various approaches and to find the best solution for different applications. One good example is the Pléiades workshop held in Paris in November 2014 named DEM computation from satellite images: Existing tools and developments where different solutions were benchmarked. The main challenge remains in finding the optimal parameters for image matching (like a custom-tailored cost function, appropriate occlusion detection and handling principle), while being computationally efficient (this includes memory optimizations, parallel processing on a multi-core central processing unit (CPU) and exhaustive usage of a graphical processing unit (GPU)). From an implementation point of view it would be beneficial to have an epipolar rectification method that can be applied for push-broom optical and SAR images.

Digital Terrain Model Generation
In the literature, DTM generation approaches are described for different kind of input data. Most of them are designated for airborne LiDAR data, some for DSMs extracted from high-resolution airborne optical images and spaceborne optical imagery. In this work, we aim for a DTM extraction approach custom-tailored to DSMs that are generated by a state-of-the-art photogrammetric workflow from VHR satellite stereo or tri-stereo images, as presented in Section 2.2. Thus, a novel algorithm will be proposed within this paper which is especially suitable for this case.

Airborne LiDAR Data
There are many publications on DTM generation from LiDAR data, where the first and last pulse data exist. Good review papers are [90,91]. In regions of vegetation, the first pulse measurement corresponds to the top height of the vegetation while the last pulse is reflected from bare-earth, thus defining the terrain height. Therefore, shrubs, trees and whole forests can be easily filtered based upon LiDAR full-pulse data. Man-made objects like buildings, however, are present in both LiDAR measurements and still have to be filtered by appropriate algorithms (or even manually) to get a correct DTM. Since LiDAR provides very accurate height measurements, 3D breaklines are in general very well preserved in LiDAR DSMs. When moving along a 1D profile from street level to a building a sharp steep height jump is observed, which can be used as a distinct feature for the DTM generation process (cf. Reference [92]). One representative algorithm is described in Reference [93], which will serve as the baseline for the algorithmic evolution presented in this paper. A method on hierarchical robust linear prediction employing LiDAR data is introduced in Reference [94].

Airborne Optical Data
For very high resolution airborne cameras the work [95] reports on a method developed for UltraCam images [96,97]. They perform a classification of the ortho images into 15 classes, both using texture and height information. Then the classification is employed within the DTM generation process. The main difference to spaceborne data is that the underlying DSMs are of significantly higher resolution and of higher accuracy such that image classification works very robust-which would not be the case for spaceborne data. A variational approach which uses a DSM as single input is presented in Reference [98]. The basic concept is to extract a very smooth surface using a strong regularization weight within a variational formulation. This over-smoothed surface is used to determine potential points the on ground (called detection mask) which are then used to interpolate a DTM by employing the same variational formulation as before. However, problems occur on large buildings that are detected as belonging to the ground. Thus, the authors proposed to add a segmentation based on maximally stable extremal regions and a manual interaction step.

Spaceborne Optical Data
For satellite-based stereo DSMs, three DTM generation approaches based on morphology, geodesic dilation and steep edge detection on simulated synthetic urban scenes are evaluated in Reference [99]. A standard method based on morphological opening was extended in Reference [100]. There, percentile filters rather than morphological grayvalue erosion and dilation are employed to be more robust against DSM outliers. Another approach is based on determining the normalized volume above ground over several scanlines [101]. An automated stratified object-based approach that also uses the spectral information of an ortho image is presented in Reference [102]. An approach based on iterative geodesic reconstruction tested on Cartosat-1 stereo images is presented in Reference [103]. DTM generation in forested regions based on Cartosat-1 stereo image is reported in Reference [104]. There, the panchromatic image is classified into semantic classes, like buildings, low forest, high forest or ground and different filtering is applied to DSM regions according to the class.

Ortho-Rectification and Mosaicing
The process of ortho-rectification is well described in literature. The main two approaches are direct and indirect rectification. The first projects every input pixel into the resulting map product by an image-to-map transformation. The second starts from the map and projects every point back into the input image by a map-to-image transformation. The process of mosaicing, that is, stitching multiple individual ortho images to one product is understood as well. Design issues are the selection of pixels for one map cell and the feathering option. For both processes we refer to the standard literature [105][106][107][108][109]. A remaining challenge can be seen in an efficient integration into the complete workflow, which will be discussed in his work.

Test Sites and Data Sets
To allow sophisticated statements on Pléiades-based mapping, several test sites have been selected. They represent different acquisition dispositions in terms of viewing angles, stereo, triplet and multi-stereo constellations, cover different types of terrain and originate from both Pléiades-1A and 1B sensors. The test sites are located in Austria, Italy, Slovenia, Singapore and China. For the core investigations, three test sites were used, for which both representative Pléiades acquisitions and appropriate reference data sets are available. The data from Singapore must not be published, thus no screenshots of this test site are presented in this work. Overall, 24 Pléiades images are used in this study, with acquisition dates and main acquisition parameters as listed in Table 1. Images are also assigned with an ID such that referring to a specific image gets simpler. When talking about a stereo model it is addressed as two IDs, like model 1-2.
For assessment of the 2D and 3D geo-location accuracy reference GCPs are needed. In the optimal case, these points are measured with in-situ highly accurate devices, such as differential GPS. This was done for the test site Singapore. Alternatively, existing orthophotos with higher spatial resolution can be used to manually define the GCPs where the height information is taken from the LiDAR reference. This was done for the test site Ljubljana. In other cases, airborne LiDAR information is used to derive GCPs. There, distinctive locations, especially at 3D breaklines, are manually measured (for instance, a corner of a roof top or a hydrant). Obviously, as the LiDAR DSMs are available with 1 m GSD those manually measured objects will never get the accuracy as the in-situ or orthophoto measurements. This has to be kept in mind when comparing the accuracy values, as the accuracy of the reference information is the lower bound for the resulting geo-location accuracy.
The multitude of test sites were chosen to demonstrate that the proposed end-to-end workflow allows precise 3D mapping over the whole globe and is able to operate with different reference data and multi-stereo dispositions.
Since the incidence angles of Pléiades images are given for across track (roll, ω), for along track (pitch, φ) and for the combined overall angle, we derive equations to calculate the stereo intersection angle δ (also called the convergence angle [42]) between two Pléiades scenes and accordingly an approximation of their base-to-height (B/H) ratio. Thus, we first define the two vectors r 1 and r 2 and second they are used to compute δ and B/H as δ = arccos r 1 · r 2 |r 1 | · |r 2 | and All convergence angles within this work are determined using Equation (1), especially the ones given in Tables 1 and 2.

Test Site Innsbruck (Austria)
This test site covers urban, rural and mountainous terrain, with ellipsoidal heights ranging from 560 m to 2750 m and spans over an area of 1200 km 2 . In July 2013, only one stereo pair was acquired from Pléiades-1A after the launch of Pléiades-1B and thus after the recalibration of the Pléiades-1A satellite. One year after, within two days in June 2014 three image triplets were achieved, two of them acquired by Pléiades-1A and one by Pléiades-1B , which show different geographic areas of this test site, that is, eastern, center and western part. The central part of the test site is shown in Figure 3. The triplets do not have an overlap on the ground, such that images from different triplets do not form across track stereo pairs. For Innsbruck, an airborne LiDAR DSM is available as reference data set, collected in 2006 at 2.0 points/m 2 . The temporal gap between Pléiades and LiDAR data has to be taken into account in the evaluation since changes occurred due to urban build-up and vegetation grow or clear-cut. The LiDAR DSM is available as raster data in UTM 32 North map projection and WGS 84 datum at 1 m GSD with ellipsoidal heights and could be used further to measure GCPs. 3D discontinuities in the LiDAR data can be used to define and manually measure ground coordinates. A fairly high number of GCPs, homogeneously distributed in planimetry and height, is measured. Targets like road intersections, corners of houses, water bodies or field boundaries, serve as control point candidates. Thus, 30 points were acquired for Innsbruck.

Test Site Trento (Italy)
This test site covers rural as well as mountainous terrain, the ellipsoidal heights ranging from 175 m to 1550 m and spans over an area of 220 km 2 . A Pléiades-1A triplet was acquired in September 2012, which however is far from optimal for 3D reconstruction, since the first two images have a very small intersection angle of only 5.5 • , while the second stereo pair has a huge intersection angle of 27.3 • . The test site is shown in Figure 4. Also for Trento an airborne LiDAR DSM is available as reference data set, collected in 2007 at 1.3 points/m 2 , available as raster data in UTM 32 North map projection and WGS 84 datum at 1 m GSD with ellipsoidal heights. Again there is a huge temporal gap between Pléiades and LiDAR acquisition to be taken into consideration. 21 GCPs were measured from LiDAR and serve as reference points.

Test Site Ljubljana (Slovenia)
This data set consists of three Pléiades stereo acquisitions from adjacent orbits over the region north to Ljubljana, Slovenia, first presented in Reference [5]. The stereo sets were acquired within three days in July 2013, the first and third set with the Pléiades-1A platform and the second with Pléiades-1B. The ellipsoidal terrain height of the region of interest ranges from 390 m to 1950 m and the scene covers about 400 km 2 consisting of agricultural land, managed forest, villages, the airport Brnik and mountainous areas. The test site is shown in Figure 5. Reference data exists in terms of 18 GCPs and 12 ICPs measured in high resolution orthophotos, LiDAR reference DSM, LiDAR reference DTM and image coordinate measurements of the GCPs and ICPs. LiDAR reference data was taken in 2015 with a mean density of 14 points/m 2 over a region of 345 km 2 . The LiDAR DSM and DTM with a GSD of 1 m were derived using OPALS [88]. This data set holds a temporal gap between LiDAR and Pléiades acquisitions of only two years. Thus, analysis can be performed on large regions as the structural changes are smaller than for the other data sets. For any other detail on the dataset we refer to Reference [5].
The acquisition scenario is depicted in Figure 6 and is of particular interest. In contrast to classical stereo or tri-stereo along track acquisitions, here also across track stereo pairs can be employed for 3D mapping, which is not the case for the other test sites and which was never done in literature so far. Regarding the multiple view geometry concept our data set holds more than the three along tracks stereo pairs from descending orbit, namely 15 stereo pairs. Table 2

Test Site Singapore (Singapore)
This test site covers a part of the island of Singapore, thus mainly consisting of a dense urban environment with very tall buildings. The height is ranging from sea level to about 200 m and the scene covers about 275 km 2 . For Singapore terrestrially measured GCPs from Singapore Land Authority and the Tropical Marine Science Institute are available, which are used for sensor model validation and adjustment. However, due to restrictions the GCP heights are rounded to integer values and thus degrade the advantage of highly accurate in-situ measurements in this context. Of the whole set of control points, 16 GCPs are located within the specific stereo scene. Available LiDAR data is under restriction as well and cannot be used within this study.

Test Site Tian Shui (China)
The study area is located in central China close to the city of Tian Shui. Due to its location, it shows a geomorphological position in the landscape since it belongs to the Tibetan Plateau and also to the southern edge of Long-Zhong loess hilly area [110]. The ellipsoidal heights are ranging from 1200 m to 2100 m and the scene covers 395 km 2 . Since for this site no reference data is available it more or less serves for visual interpretation purposes only. To allow accurate relative orientation tie-points are manually measured and applied within the sensor model adjustment. The test site is shown in Figure 7.

Methods
The envisaged end-to-end mapping workflow comprises four core tasks to be efficiently accomplished (cf. Figure 1). Thus, in this section we discuss our algorithmic approaches beyond the state-of-the-art with respect to (1) sensor modeling and parameter optimization, (2) generation of DSMs from multi-stereo Pléiades images, (3) generation of DTMs using the extracted DSMs and (4) ortho-rectification of the multi-spectral image data. In addition, evaluation protocols are presented that are later employed in the assessment in Section 5. The structure of this section follows the flow graph Figure 2 and thus the related work presented in Section 2.

Sensor Modeling, Parameter Optimization and Geo-Location Accuracy Assessment
The RPC sensor model, as defined in References [40,41], relates geographic coordinates to corresponding image coordinates (i.e., a map-to-image transformation) by two rational cubic polynomials. The RPC model consists of 80 coefficients and approximates the physical sensor model in a standardized manner. For optimizing an RPC sensor model GCPs and their related image coordinates are employed. The optimization can be formulated in object-space or in image-space [41]. We pursue an adjustment in object-space since then the solution is also dependent on the height. This would not be the case if, for instance, an affine transformation is adjusted in image-space [40,42]. From the mathematical perspective, the parameter optimization can be formulated as a non-linear equation system. Like in Reference [41], we use Newton's method to solve the equation system and thus to optimize the sensor model.
To assess the geo-location accuracies, we follow the protocol that we defined in Reference [48]. The 2D geo-location accuracy of initial and optimized sensor models is evaluated for each images using reference GCPs and also their 2D locations in this image (cf. Figure 2). The 3D GCPs are projected onto the 2D image using the RPC model and the deviations to the reference 2D image coordinates define the model accuracy. Those differences are described using statistical measures such as mean, standard deviation (STD) or root mean square (RMS) errors in pixels. We give those values for across and along track direction such as the absolute length of the differences. Often those statistics are given for GCPs and ICPs.
For stereo acquisitions also the 3D geo-location accuracy can be determined. In this case, a spatial forward point intersection is performed based on the 2D image coordinates and the sensor models yielding 3D points. Those points are compared with the reference GCPs and ICPs resulting in statistics on 3D locations errors in meters (assuming that the underlying map projection is metric). Here the statistical values are given for East, North and Height direction such as for the absolute Euclidean distance (also called length).

Digital Surface Model Generation
As illustrated in Figures 1 and 2 the workflow for DSM generation comprises several steps that have to be applied sequentially and are described in detail in this section.

Epipolar Rectification
For epipolar rectification of the Pléiades push-broom images our approach in Reference [54] is modified that initially was designed for SAR images. However, we show that it can also be applied to push-broom geometries by supporting the RPC sensor model within the framework. This aspect highlights this method since a single implementation is applicable to multiple geometries. For details of the rectification method we refer to Reference [54]. Since, the underlying transformations from image to epipolar rectified geometry are non-linear and also incorporate iterative image-to-map transforms (e.g., solved via Newton's method and linearization), pixel-by-pixel evaluation is computationally expensive. Especially, as the epipolar Pléiades images are huge, ca. 50,000 × 50,000 pixels for a full scene, an approximation is desired. It can be observed that the transformation is a locally slowly changing continuous function. Thus, we propose an algorithmic extension by a rigorous evaluation on grid points only. Transformation results between the grid points are then achieved by bilinear interpolation. We set the grid size such that the maximal pixel shift in column and line direction in comparison to the rigorous transformation is below 1/1000 of a pixel and thus negligible. For the test site Ljubljana, this accuracy is achieved when using a grid size of 250 pixels. We reduce it to 100 pixels to ensure errors below 1/1000 of a pixel also for any other possible scene, thus reducing the transformation effort by a factor of 10,000. Since the epipolar rectification maps an image from the original resolution to the same resolution, the GSD does not influence the mapping accuracy and thus the grid size does not need to be adjusted w.r.t. the GSD. Actually, the error limit of 1/1000 of a pixel can be seen as no error as we are working on the image domain. We could also set this limit to 1/100 of a pixel which then corresponds to a grid size of 700 pixels and thus a speedup of 500,000. To avoid recalculation of those grids they are also stored to be used later in the spatial point intersection. Those grids are called forward and backward addresses. Figure 8 depicts the geometric relation between stereo input images, their epipolar versions, forward and backward addresses and both disparity maps gathered in the matching step (cf. Section 4.2.2). For instance, a pixel coordinate from image 1 given in Figure 8 can be transformed into the epipolar version of image 1 by using the forward addresses of this image. Overall, the proposed approximation results in a significant speedup, thus extending the approach in Reference [54].

Image Matching
As the image matching is employed on epipolar images, the disparities have to be searched along one dimension only, that is, the column direction. The classical matching paradigm is depicted in Figure 9 and consists of three main steps: disparity space image calculation (i.e., creation of the 3D cost volume), cost aggregation (i.e., an optimization by incorporating the neighborhood of a pixel) and disparity section (i.e., selection of the final disparity per pixel from the aggregated cost volume).
The current state-of-the-art in cost volume calculation is the normalized Hamming distance of the Census transform [72], as it is invariant to non-linear radiometric distortions and very efficient in its calculation. Since it compares all pixels within a patch to the central pixel value the transform becomes ambiguous when the patch holds homogeneous, that is, textureless, values. Therefore, the modified Census can be used that compares the pixels to the mean value of the patch [75,111]. We propose another option that works well for satellite-based images, which is to stabilize the Census transform by combination with a sum of absolute differences (SAD) cost function. SAD is normalized by the bit depth of the images, that is, 2 16 for Pléiades. Normalized cross-correlation yields more blurring at 3D breaklines than Census transform such that it is not applied (cf. Reference [73]). It is also important to note that within this work the matching only uses the panchromatic band of the image. Obviously, the pansharpened multi-spectral information could be used as well. In this case, the combined cost-function is defined as the weighted sum of the costs for each spectral band. However, previous experiments showed that color information is not helping much but slows down the calculation of the disparity image significantly [112]. The influence of the patch size on the performance in terms of runtime and accuracy is evaluated in Reference [73], where a 9 × 9 pixel patch performs best. In literature often 9 × 7 pixel patches are used as those 63 bits can be efficiently stored in a 64-bit register (cf. Reference [113]). After calculation of the cost volume the cost aggregation is performed with semi-global matching (SGM) introduced in Reference [61]. It combines the efficiency of a local method with an accuracy comparable to global methods. This is achieved by approximating a 2D Markov random field (MRF) optimization with several 1D scanline optimizations (cf. Reference [71]). Those scanline optimizations can be efficiently solved via dynamic programming yielding low runtimes. It has been shown that SGM is a special case of message passing algorithms like belief propagation and tree-reweighted message passing [69]. Overall, SGM approximates the energy minimization of a 2D Markov random field with C p (d p ) being a unary data term that represents the matching cost of a matching pixel p at disparity is a pairwise smoothness term that penalizes disparity differences between neighboring pixels in the neighborhood N . In SGM, V implements a first-order smoothness constraint as with the penalty values P 1 and P 2 . They penalize disparity jumps of 1 pixel and of multiple pixels. Now instead of minimizing the MRF, which is NP-hard, SGM minimizes a 1D version of Equation (2) along 8 cardinal directions r via dynamic programming [61]. For each direction r an aggregated matching cost L r (p, d) is recursively calculated as Then, the 8 aggregated costs are summed at each pixel, yielding the aggregated cost volume where the minimum at each pixel is chosen as the final disparity Since 3D breaklines are often visible in the input images as local changes of the grayvalues, the penalty P 2 can be adjusted w.r.t. these local changes of the reference input image I in the current direction r [61]. As in the original paper no details on the adaptive penalty function are given, we propose to define P 2 as a function of I, r and three parameters , τ and P 2 as with µ I the mean gradient in column and line direction of I, q the adjacent pixel coordinate of p based on r, P 2 the initial penalty value and some to avoid a division by zero if I p equals I q . The parameter is set to = min(|I p − I q |) | ∀|I p − I q | = 0 , that is, becomes 1 for most images of type uint8 or uint16. The equation also ensures that P 2 is always larger than P 1 by at least a predefined τ. We set τ in a pragmatic way to τ = P 1 + (P 2 − P 1 )/100 such that it automatically adapts to the input penalty values. This extension results in twofold improvements: (1) in homogeneous regions the penalty P 2 becomes huge such that large (mostly incorrect) disparity jumps are omitted and (2) at pixel locations with large local grayvalues changes P 2 become smaller and thus 3D jumps are cheaper for the optimization. Since 3D breaklines often correspond to grayvalue changes such jumps are better modeled by the algorithm.
The resulting disparities d p are calculated at integer precision. To get subpixel refinements the classical approach is to fit a polynomial of order 2, that is, a parabola, into the three values surrounding the best-estimated disparity. The minimum of this polynomial, which can be extracted analytically, then defines the subpixel disparity.
For outlier detection and interpolation, the following options are implemented. A classical left-right consistency check is performed by employing backmatching. Disparities with a backmatching distance larger than a given threshold are marked as outliers. The standard threshold value is just above √ 2 such that a matching is correct if the backmatching hits within the local 3 × 3 window (e.g., 1.5 pixel). Then also a threshold on the aggregated cost is performed. Lastly, a peak filter detects connected regions (cf. Reference [61]), where small islands are discarded. In the next step, discarded pixels are classified as blunder or as occlusions as described in Reference [61]. Then, blunder locations are linearly interpolated based on their neighbors, while occluded locations are interpolated toward the ground level to avoid blurring at 3D breaklines.
Since the Pléiades images are huge, we propose an algorithmic extension mainly resulting in speed-up as follows. While in Reference [61] a hierarchical image pyramid based approach is used to determine the disparity search range for the final level (i.e., the original image), we apply the paradigm first presented in Reference [114] to limit the search range for each pyramid level (visualized in Figure 10). Instead of operating on the complete cost volume we propose to only use a band of this volume for disparity search. This change results in a significant speed-up, especially for scenes with large height variation, since we only have to search for a given number of disparities in the last level (e.g., 41 pixels instead of 2000 pixels, which is the average disparity range of the Ljubljana data set). Depending on the topography of the scene this cost volume (and also the aggregated cost volume) can be extremely large. For instance, on the Ljubljana test set this volume has a size of about 20 terabytes in 32 bit float encoding. With the proposed extension it shrinks to 2% of the original size, which is still around 400 gigabytes. We call this method truncated SGM. As depicted in Figure 10 it might be the case that for a disparity d the adjacent disparity d is not calculated thus does not exist. Therefore, we propose to modify the smoothness term given in Equation (3) to (8) Figure 10. Cost volume C p (d) of classical SGM versus our implementation is shown on a toy example. While SGM evaluates the whole cost volume, we only check the disparities visualized in light blue color around the disparity predictions from the previous pyramid level depicted in dark blue color. In this toy example the search space is reduced to 44%, however, the reduction is larger in real satellite images.
Depending on the 3D structure of the scene maximal memory consumption reduction between 65% to 99% is achieved while the runtime is reduced by 30% to 98% (also cf. Reference [65]). In addition, instead of storing the cost volume and the aggregate cost volume as 32-bit float values, we discretize them to 16-bit unsigned integers (uint16). This is done by introducing a scaling factor and yields another 50% memory reduction. This discretization obviously affects the results. However, the rationale is that the discretization changes the original values only a little bit (some fractions as the initial costs are in the range [0, 1]). Thus, if those small changes influence the argument of the minimum of the aggregated costs given in Equation (6), then those disparity values are very likely incorrect anyway. Such regions are later detected in blunder and occlusion classification and removed. Overall, the discretization has a non-significant influence on the final disparity map. Next, we observe that due to the prior epipolar rectification a significant region on the border of the input images consists of nodata values. In cases where all disparities of investigation only contain such invalid regions, the SGM optimization could be skipped. Therefore, we introduce a specific uint16 value for the disparity space image that indicates those regions. The runtime of the SGM optimization can then be reduced by around 33% (mean value over our test images).
To gain speedup and accuracy, we incorporate a coarse DSM (e.g., SRTM DEM [115], ASTER GDEM [116], World DEM [117]) on the top pyramid level to calculate disparity predictions and thus to limit the search space and reduce ambiguities. For a grid of pixels in the reference image, an image-to-map transformation is calculated using the coarse DSM. The resulting map coordinates are then projected into the search image with a map-to-image transformation. The couple of 2D image coordinates is then transformed into the epipolar images using the forward addresses (cf. Figure 8). The column difference of the epipolar pixel coordinates then serves as prediction for image matching. With such disparity prediction the search space can be significantly reduced at the top level.
Like others, we found that the simple parabola subpixel interpolation yields a bias in the distribution of fractional disparities (cf. Reference [118]). This bias causes incorrect reconstructions of tilted surfaces. To reverse this effect we calculate a mapping between the current fractional distribution and a uniform distribution. It is then applied efficiently via a lookup table. This effect is depicted in Figure 11.

Spatial Point Intersection
From each valid matching result (i.e., disparity) a 3D point can be calculated by spatial point intersection, finally yielding a 3D point cloud irregularly distributed in space. In the first step, for each pixel in the disparity map, the locations in the reference and search input images have to be determined. As indicated in Figure 8 the backward address of the reference image and the backward address of the search image are employed in this step. Then, the spatial point intersection can be performed via an iterative least squares adjustment. The resulting point cloud can be stored in LAS or LAZ format [119] or in a multi-band image. In this paper, the second option is used, where a three-band image of the size of the disparity map then stores East, North and height values (cf. Figure 12a and [84]).

DSM Resampling
Since in remote sensing a classical data representation is still the 2.5D DSM the 3D point cloud has to be mapped to a 2D image with equidistant spacing (i.e., a regular grid). This process is also called resampling or regridding. We present two simple approaches to doing so. First, the dimension and spacing of the output DSM are defined. Then, we use either nearest neighbor interpolation or areal interpolation in the regridding step. In the former case, each 3D point is projected into the DSM and put into the nearest integer position. If this value was already assigned then the maximum between the previous and the current value is chosen. In this way, the height value closest to the sensor is reconstructed. In areal interpolation, four adjacent 3D points are projected into the DSM. Then, all integer locations that are covered by this quadrangle are linearly weighted interpolated. Double assigned pixels are treated as described above. Overall, this procedure is a direct resampling that potentially is not hitting all pixels of the DSM grid. However, as multiple DSMs get fused in further processing, this simple strategy is sufficient and in reality the nearest neighbor approach is usually used due to performance benefit. The DSM resampling is depicted in Figure 12.

DSM Fusion
In the proposed processing chain, the DSM fusion is applied for two main reasons: (1) to combine the information of forward and backward matching in cases when only a single stereo input set exists and (2) to combine multiple forward and backward matchings in cases when multiple input (stereo) sets exist. Thus, the fusion process gets multiple DSMs as input and calculates one final DSM. For simplicity, we stick to a local method [85] which takes all height measurements within a 3 × 3 pixel neighborhood and extracts the probability mode of this probabilistic height distribution. The only required parameter is a threshold value of how many input height values have to exist such that an output is generated. To get high coverage this value is set to 1 by default. We extend this method for arbitrary neighborhoods and also incorporate a step size. This step size can be used to generate a lower resolution DSM by, for instance, only extracting the probability mode at every second pixel in column and line direction. In contrast to a straightforward median-based approach, local height errors can better be eliminated using the probability mode [85].

DSM Assessment
In remote sensing, it is common sense to compare two DSMs based on their differences. This leads to a distribution defined as DSM ∆ = DSM ref − DSM current , where in our case the reference is a LiDAR DSM and the current DSM the one retrieved from Pléiades image data. For judging the quality of the differences the statistical values mean and standard deviation are used. However, this only makes sense if the underlying distribution is a normal distribution. As observed by researchers this is actually not the case (cf. References [42,[120][121][122]). First, both DSMs include strong outliers such that the distribution of differences contains heavy tails. Second, the distribution itself is often non-Gaussian as, for instance, gaps between buildings are not reconstructed from satellite data causing a systematic bias next to the main lobe. In addition, the mean and standard deviation measures are non-robust estimates which suffer from both aspects. Therefore, in this case, robust estimates have to be used; in particular, the median (MED) value for the center of our normal distribution and the normalized median absolute deviation (NMAD) for the standard deviation. The NMAD is defined as as NMAD(∆H) = 1.4826 · median(|∆H − median ∆H|) . Figure 13 depicts a real example of this study where only errors between −10 m and +10 m are plotted. In particular, we show the distributions of height differences of a stereo DSM from test site Ljubljana (blue) with the classical normal distribution fits based on mean and standard deviation (red) and the robust fits based on median and nmAD (green). From this plot, it is clearly visible that the standard deviation of the classical fit is far too large which is a result of the non-robust estimate due to outliers. The NMAD based normal distribution is narrower and fits the input data better. The estimated central value of the default fit is drawn to the negative side. This obviously happens with the robust fit as well, however, less distinctive. This aspect has to be kept in mind when interpreting height differences and thus, will later be discussed in Section 5.

Digital Terrain Model Generation
To extract a terrain model from a surface model, man-made structures and vegetation have to be removed from the given input DSM. As depicted in Figure 1, the workflow for DTM generation comprises two main steps: determination of bare-earth points and interpolation of all other points and according post-processing. As no state-of-the-art method was found that delivered satisfying DTMs at fast runtimes, a novel method named multi-directional slope dependent (MSD) DTM generation is presented in this section. Our Matlab source code is available at GitHub: https://github.com/ rolandperko/dsm2dtm.

MSD DTM Generation Approach
The presented DTM extraction approach is especially suited for DSMs that are generated by a state-of-the-art photogrammetric workflow from VHR satellite stereo or tri-stereo images. One constraint of such DSMs is that 3D breaklines are not always clearly defined, which can be traced back to occluded areas that cannot be reconstructed. Our idea is to extend the algorithm of Reference [93], which is designed for LiDAR data, to be slope dependent and to be multi-directional, that is, 8-directional to span the complete 2D image space. Nonetheless, the focus is put onto simplicity, robustness and computational efficiency to follow integration needs into the automatic end-to-end workflow:

•
Determine points in the DSM that are located in bare-earth regions.

•
Apply some post-processing on the ground mask and remove all other non-bare-earth regions.

•
Fill the resulting holes using DTM interpolation.

•
Apply some post-processing.
Obviously, the crucial and most difficult step is the first one. Based on our literature study the most promising algorithm to start with is the directional filtering method by References [91,93]. A short recap on directional filtering according to Reference [93] can be given as follows: The basic idea is to process each line of the given DSM separately, for example, from left to right with a sliding window of a given extension. First, the minimal value in this window is determined, which is considered to represent the bare-earth terrain at the minimal position. In this step, it becomes obvious that an object to be filtered has to be smaller than the filter extent. Then, if the current pixel under examination has a large difference to the minimal value w.r.t. a given height threshold, it is considered as a non-ground point. If this is not the case and the slope between the current pixel and the next one in scanline direction is larger than a given slope threshold, the pixel is also considered as a non-ground pixel. If the slope is positive and smaller than this threshold the pixel gets the same label as the previous pixel. If the slope is negative, then the distance to the closest ground point is used to decide whether the pixel is classified as ground or as non-ground. This method suffers from two drawbacks:

1.
A bottleneck is inherent to the negligence of the local slope of the underlying terrain, such that no useful results can be expected on tilted surfaces or in mountainous areas.

2.
Since it works on 1D image profiles, hereby denoted as scanlines, it yields fast runtimes. However, when applying single scanline processing only, the complete 2D context into which an object is embedded is lost. In addition, the given DSM can have unsharp 3D edges in some scanline directions.
Using these simple insights, two main extensions to the directional filtering concept are proposed for the development and implementation of our novel MSD DTM generation method:

1.
Incorporation of the local terrain slope.

2.
Extension of the local horizontal scanline approach to multiple scanlines, which spans the full 2D image space.
The main reason for the simplicity of the MSD algorithm is the fact that the robust fitting is replaced by a smoothing done beforehand. Now the terrain slope fit is directly gathered using the difference value of this smoothed input DSM. After removing the non-ground points, the resulting holes are filled employing a triangulation based linear interpolation. The algorithm uses three parameters, that is, the filter extent in meters, the height threshold in meters and the slope threshold in degrees.

Consideration of Local Slope
The main drawback in References [91,93] is that the local slope of the terrain is not considered in the processing. If this algorithm is used to process regions which are not flat, incorrect filtering occurs since the difference of the current height to the estimated minimal value depends on the terrain slope. If ignored the following height thresholding also removes bare-earth regions. This issue is sketched in Figure 14, where a 1D profile of an artificial DSM to be filtered is given in (a). It shows a tilted surface and a building with some noise. The potential filter extent is visualized as the blue dashed line. The minimal height value within the filter extent is marked as black dot. It is always dragged towards the terrain fall and thus is incorrect (in the example a height of 93.9 m). If we apply a robust terrain slope fitting, the green dashed line in (b) is received. Using that tilt, the initial profile could be slope corrected (c) and the new minimal value can now be correctly extracted (98.7 m in this example).
In addition to such an incorrect minimal value, the slope estimate of two adjacent DSM values should be corrected w.r.t. the local terrain slope as well.
The first algorithmic extension is very intuitive and sketched in Figure 14. In the workflow, a robust fit of the height values within the filter extent must be performed. An obvious solution would be to use an iterative weighted least squares method with bisquare (Tukey) or Huber weighting function (cf. Reference [123]). However, this robust fit has to be performed for each pixel and for all scanline directions and thus would slow down the process. Therefore, we propose a heuristic which yields similar results as the robust fit but can be calculated very efficiently. The idea is to smooth the DSM using a huge Gaussian kernel. Local objects like buildings disappear while the terrain slope remains in this smoothed surface. The slope heuristic in the desired direction can then directly be calculated as the pixel difference of two adjacent pixels. Therefore, a 2D Gaussian smoothing is implemented with a spatial sigma σ and a kernel size of n × n m 2 . Per default those parameters are set to σ = 25 m and n = 101 m. The (separable) 2D convolution is implemented as two 1D convolutions for speedup. The pixel difference of the central (smoothed) pixels then defines the local terrain slope value. On the Innsbruck East triplet data set the differences of the robust and heuristic slope estimates have a mean value of 0.05 • with a standard deviation of 1.76 • . These differences are small and acceptable, especially taking the speedup into account. Figure 15 shows a real example for the downward scanline. Given are a subset of the DSM and the local slope estimate as achieved by smoothing. The plot shows the original 1D DSM profile (black), the robust fit of the terrain slope (green), the proposed simple fit of the terrain slope (cyan) and the robust and simply corrected data (red and blue). It is obvious that the simple and fast Gaussian-based approximation yields very similar results in comparison to the iterative bisquare weighted least squares solution. Actually, initial tests revealed that it is faster by a factor of at least four magnitudes for a filter size of 91 pixels (i.e., a speedup factor of at least 10,000).

Multi-Directional Filtering Approach
The second algorithmic extension concerns the data processing scheme in terms of scanlines. As stated above, consideration of only one scanline direction from left to right yields a local solution and the filtering result could abruptly change within two adjacent pixels from two neighboring scanlines. Instead, we propose to use 8 scanlines and fuse the results for final pixel classification. The principal concept to solve a problem in 2D by fusing multiple 1D solutions is based on the work of Reference [61]. In the presented case the fusion is simply based on majority voting. If more than 5 scanlines classify the current point as a ground point, this point is classified as ground point, else as non-ground point. This aspect is very important for satellite stereo DSMs, as some scanlines may classify a non-ground point incorrectly (e.g., due to a smooth height transition) while the combination of classifications achieved from 8 directions certainly helps to improve final pixel classification accuracy.
The complete novel algorithm is described in Algorithm 1 resulting in a ground and a non-ground label image. Instead of processing each scanline sequentially, the image can be scanned in two passes, first from the upper-left to the lower-right corner and second from the lower-right to the upper-left corner, processing four scanlines at each pass (cf. Figure 16). Doing so, image data has to be read only once and instead of storing eight label images, one label image can be used to sum up all ground pixels, followed by thresholding as stated above. This specific processing reduces the memory consumption by a factor of 8 and results in a minor speed-up since the input data is only read once. ThrSlope // Slope threshold in degrees.

Extension
The following extensions could be easily performed, where currently only the first one is supported by our implementation. The second and third options are very generic and independent of the proposed MSD DTM generation method, such that they are not evaluated here.

•
Apply a peak filter after the filtering step to get rid of small, mostly incorrect, regions. This idea is similar to the filtering of disparity maps in image matching.

•
Since, taking the minimal value of a distribution is not robust to outliers, the nth percentile could be taken instead. Then the algorithm becomes robust toward outliers in the input DSM.

Ortho-Rectification
Standard direct ortho-rectification is performed for each image. In the presented workflow, the rectification is embedded in the DSM resampling step during DSM generation. There, the multi-spectral images are available as epipolar rectified products. In the step of spatial point intersection, those points get their 3D coordinates and their multi-spectral values. Thus, the DSM resampling step is also applied to generate the ortho-images. For mosaicing, the ortho-images are sorted w.r.t. to their global incidence angles such that the more nadir ones are prioritized. During fusion, a radiometric block adjustment is performed based on a least squares adjustment.

Results
This section reports on a multitude of results based upon the previously presented methodologies. In particular, sensor modeling and optimization, digital surface model generation and digital terrain model derivation are addressed. First, the 2D and 3D geo-location accuracy is assessed for all images and stereo sets. This investigation includes a study on different parameters that are adjusted in the sensor modeling step. Second, the quality of the resulting DSMs is assessed. Here an investigation of the epipolar geometry, qualitative and quantitative evaluation of DSM accuracy w.r.t. reference LiDAR data and stereo versus multi-view stereo experiments are given and discussed. Third, DTM extraction is performed and results are compared to LiDAR reference data.

Sensor Modeling and Parameter Optimization
As first task, the 2D geo-location accuracy of all sensor models is evaluated following the assessment protocol defined in Section 4.1. The resulting statistics are presented in Table 3. Statistics are given in across and along track direction such as for the absolute 2D length. Here, it can be observed that most of the images have mean circular errors (represented as mean initial length) below 17 pixels, which corresponds to the 8.5 m CE90 as reported in Reference [32]. The Trento triplet holds the largest geo-location errors even above the CE90 values.
As second task, all sensor models are adjusted by optimizing the constant and linear RPC terms and the resulting RMS values are also reported in Table 3. The across and along track residuals are all in the sub-pixel range (0.3 to 0.9 pixels) and show a widely homogeneous behavior. According to the nominal GSD of 0.5 m, these pixel values correspond to geo-location errors in the range of 0.15 m to 0.45 m on ground. As discussed in Section 3, the lower bound of those errors is the accuracy of the reference data. As the reference was manually measured for some test sites better results cannot be expected. Note, that the accuracies are rather high when considering the reference LiDAR data with 1 m GSD for some of the test sites. Since the discretization and manual measurement errors are assumed to be of Gaussian distribution, the least squares adjustment may find solutions which are better than the individual measurements.
As third task, a specific test is performed on the image 1 of the Innsbruck stereo test data (cf. Table 1 and Figure 3). The 2D geo-location accuracy of this selected sensor model is evaluated where the parameter optimization is once based on releasing only the constant term of the RPC nominators and once by releasing the constant and linear terms. To doing so, the given 30 GCPs are divided into different sets of GCPs and ICPs. The results of this analysis, including mean and STD of checkpoint residuals, are summarized in Table 4. The table shows that utilization of the absolute minimum number of GCPs (i.e., 1 for constant and 4 for constant and linear coefficient optimization) yields systematic errors for both scenarios, as expressed by the mean residual values. Appropriate over-determination, for example, utilization of 10 GCPs in this assessment, reduces such systematic errors significantly. The numbers also show that removing the shift only might be sufficient to achieve reliable accuracy, which is also confirmed in Reference [42]. Overall, we propose to perform the constant and linear-based adjustment since smaller residuals are retrieved and potentially systematic errors can be reduced.
As fourth task, another specific test is performed on all six images from the Ljubljana test site. Here, the sensor models are adjusted based on the optimization of constant and linear terms with the given GCPs. Then, those models are also evaluated on the ICPs to assess if an over-fitting occurs. Statistics are presented in Table 5. Here, the mean standard deviations after adjustment are 0.47 pixels for GCPs and 0.73 pixels for ICPs. As expected, the models adjust to the GCPs but also give decently small errors on ICPs.
Overall, all models show initial displacements and thus have to be corrected before continuing with the mapping workflow.  Next, the 3D geo-location accuracy using the initial and optimized sensor models is assessed, based on the protocol defined in Section 4.1. The RMS values of the 3D point residuals which were achieved for East, North, Height and their 3D length are summarized in Table 6. These values indicate the 3D mapping accuracy that is feasible when employing the stereo or triplet models for 3D reconstruction. First, it is obvious that the initial accuracy of the rational polynomial models yields 3D displacements in the order of several meters, which is clearly beyond the aspired precision. Next, the values given in the table show that the stereo intersection angle δ, as an equivalent for the base-to-height ratio, has a predominant impact onto the 3D geo-location accuracy. Higher accuracies are achieved for image pairs covering larger stereo intersection angles and vice versa. This aspect is depicted in Figure 17 which shows the RMS values of the 3D length residuals plotted versus the convergence angles. The red curves give the theoretical 3D errors that would result from a given 2D measurement error in meters.  Table 6) are shown in blue. It is clearly visible that the two stereo pairs with small intersection angle (i.e., Innsbruck triplet center 1-3 with 4.6 • and Trento triplet 1-2 with 5.5 • ) yield significant worse 3D geo-locations than all other stereo (and tri-stereo) constellations. For constellations involving larger stereo intersection angles the RMS values in planimetry are in a range of 0.2 m to 0.5 m. The RMS values in height are between 0.30 m and 1.5 m. The variations in height are diverse for the given test sites, which can be traced back to the accuracy of the reference data. Here, the best results are achieved for the Ljubljana test site, where highly accurate reference data is available such that the RMS in height also gets down to 0.3 m. In contrast, the Singapore test site holds rounded height reference values and thus this test site shows the highest RMS in height with 1.5 m. Overall, the sensor model adjustment results in optimized models that can be used to derive 3D information with high quality, that is, 0.3 m in planimetry and height, if the reference GCPs are also in this accuracy range.
Analog to the 2D accuracy assessment, the 3D mapping accuracy is foremost analyzed for two test sites based on GCPs and ICPs. Then, it is evaluated w.r.t. LiDAR reference data. First, the Innsbruck stereo set and second the Ljubljana multi-view stereo set are analyzed. For the Innsbruck test site, tests with respect to the utilization of a different number of GCPs and ICPs as well as with respect to the comparison of constant versus constant and linear nominator coefficients optimization are performed. The results of this analysis, including mean and standard deviation values of checkpoint residuals, are summarized in Table 7. Again, the utilization of a minimum number of GCPs yields systematic geo-location errors in East, North and height as manifested through the corresponding mean residual values. Over-determination as exemplary given, for example, by utilizing 10 GCPs, reduces these systematic errors to a more or less negligible order of magnitude and yields distinctly improved 3D RMS accuracy values, widely adequate for both optimization scenarios. The second test is based on the Ljubljana set, where the 3D geo-location accuracy is evaluated using GCPs and ICPs with the results depicted in Table 8. The initial RMS values reveal that the 2D geo-location residuals directly propagate to the resulting 3D geo-location accuracies. Thus, all stereo pairs containing the images 5 or 6 yield large displacements (also cf. Table 5). After adjustment, the majority of pairs yield high accuracies for the GCPs in planimetry of 0.2 m to 0.3 m and also in height of 0.2 m to 0.4 m. Actually, an accuracy at this level was never achieved before and is based on the highly accurate reference data. The statistics of ICPs are, as expected, a bit worse but no overfitting is observed. Figure 18 shows the 3D errors for GCPs and ICPs sorted w.r.t. the ICPs. Interestingly, there are four stereo pairs holding lower accuracy in height (and thus also in the 3D length), namely the pairs 1-3, 2-4, 3-5 and 4-6. When compared to the acquisition disposition depicted in Figure 6 all those pairs are pure across track images from two adjacent orbits with small convergence angles of about 12.5 • , which is the reason for the poor height estimates .

Digital Surface Model Generation
Before explaining the resulting DSM quality a very interesting point about epipolarity is discussed, which is neglected in other literature. As explained a high-quality epipolar rectification is a pre-requisite for 1D image matching, like the semi-global matching. However, some may think that a Pléiades stereo pair is consistent even without sensor model adjustment. Thus, it may be sufficient to extract a DSM before adjustment and later shift this DSM to some reference system (cf. Reference [5]). Therefore, a test is conducted to determine if the initial sensor models are accurate enough to get epipolar images. A number of TPs is measured in the original images and projected to the epipolar images. There, the distance orthogonal to the epipolar direction (i.e., the column direction) is gathered, which should be zero in the optimal case. Table 9 shows the statistics of the deviation of tie-points orthogonal to the related epipolar direction before and after sensor model adjustment for the Innsbruck East Triplet. While the STD values only reduce a bit, a significant change in the mean values is observed. Especially in the pair 2-3 the initial mean value is −0.66 pixel, thus more than a half-pixel off. It is known that 1D matching degrades drastically if the correct match is not located in the current 1D search space. Overall, this is a proof that sensor models should be adjusted before further 3D processing. In cases where no GCPs are available the sensor model adjustment should be performed with TPs (cf. Reference [5]). Each TP between two images yield four additional equations in the non-linear equation system (cf. Section 4.1) but also three additional unknowns (i.e., the 3D location of that point). TPs can be automatically derived for such stereo images by, for instance, using the scale invariant feature transform (SIFT) [127], speeded up robust features (SURF) [128] or similar techniques [129]. In this case, a block adjustment can be performed for multiple images and thus for multiple RPC sensor models simultaneously.  Next, for all test sites dense DSMs were extracted using the Pléiades data and the presented workflow. The following accuracy assessment is based on the protocol defined in Section 4.2.6. The evaluation is limited to the three prime test sites Innsbruck, Trento and Ljubljana. For image matching, the parameters as given in Table 10 are used for all experiments. The two sets describe the standard version and the more accurate version. The main differences are the cost functions and the search sizes of intermediate and final pyramid levels. Due to the search sizes, the accurate version is significantly slower and thus for some application the standard version is useful as well. In the given evaluation, the settings of the accurate version are used. The matching and initial DSM generation are performed with the original resolution of 0.5 m. The downsampling to 1.0 m GSD is then performed in the fusion process by using a kernel size of 3 × 3 pixel and a step size of 2 in East and North direction. This downsampling is done to allow a direct comparison to the LiDAR data which is also given with 1.0 m GSD. For visual comparison and analysis, subsets of those test sites are presented in this section.  Figure 19 shows a subset of the Trento test site, which covers a hospital and its surrounding area of 360 × 360 m 2 . A Pléiades ortho-image, the corresponding LiDAR DSM, the stereo-derived DSMs as well as the triplet-derived DSM are illustrated. The terrain heights of the DSMs are scaled between 240 m (black) and 300 m (white). Infrastructural changes due to ongoing construction activities since the LiDAR acquisition can be seen. For instance, there is a new helipad and a parking lot, which previously used to be a park. The stereo-derived DSMs, in general, are affected by occlusion areas, which use to increase with increasing stereo intersection angle and which almost vanish in the triplet-derived DSM. The DSM derived from the 1-2 Pléiades stereo pair looks different in comparison to the others and seems to be more similar w.r.t. the LiDAR model. This is due to the small stereo intersection angle, which implies higher image similarity and thus a higher performance of the stereo matching with reduced occlusion areas. However, due to this weak geometric disposition, the height accuracy of this DSM is worse than for the other DSMs.
Exemplary quantitative analysis was made through comparison of LiDAR to stereo and triplet-derived DSMs. Due to the large temporal gap between LiDAR and Pléiades data acquisition only selected areas were analyzed which are not affected by temporal change due to construction, vegetation growth or cloud cover. The results are summarized in Table 11. The STD of the height differences confirm the interdependence of stereo intersection angles and 3D mapping accuracy, with worst accuracy achieved for the 1-2 stereo pair, that was discussed above. The triplet-derived DSM shows similar accuracy like the other stereo-derived DSMs. Nonetheless, it shows the best consistency in comparison to the LiDAR DSM when considering its visual appearance. It shows a clearly reduced amount of occlusion areas and more reliable and improved structures of buildings.  Figure 20 shows a subset of the Innsbruck test site, which covers an urban region of 1950 × 1000 m 2 . It illustrates an ortho-image, a LiDAR DSM, a stereo DSM generated by the commercial software Geomatica 2013 by PCI Geomatics (denoted as 2D matching DSM) and the stereo-derived DSM as generated by our workflow (proposed DSM). The comparison to this old PCI version is just performed to emphasize the influence of the stereo matching algorithm as in this version a 2D matcher was implemented. Newer versions also use SGM and thus show similar results than ours. The DSM height values are scaled within 620 m to 680 m. When comparing LiDAR and Pléiades DSMs, temporal changes again occur, like a new residential area in the upper-left image area. It is obvious that the proposed workflow preserves 3D breaklines better than the workflow based on 2D matching. For instance, large buildings are decently reconstructed while they are missing in the 2D matching based DSM.  For a partly forested region at test site Innsbruck with 550 × 550 m 2 Figure 21 shows a Pléiades ortho-image as well as the height differences between the LiDAR and the Pléiades DSM, scaled from −25 m to +25 m. Thus, bright areas indicate forest clear cuts, while dark areas correspond to forests growth between the LiDAR and the Pléiades acquisition dates. Analog to the Trento test site, a quantitative accuracy assessment was made for a selected area, which could be supposed to be free of temporal changes. The results are included in Table 11 and show slightly better accuracy than the Trento data sets. Overall, one very important outcome is the fact that the final DSM accuracy is worse w.r.t. the maximal possible values gathered through the sensor model optimization (cf. Table 6). Obviously, this decrease comes from the image matching step that on the one hand cannot yield the same accuracy as a manual point measurement. On the other hand in homogenous and repetitive image regions errors will occur or incorrect matches will just be interpolated.
The last test set Ljubljana is the most interesting one, due to its multiple view geometry. The three along track stereo pairs can be used to form 15 along and across track stereo pairs as listed in Table 2. For all those 15 pairs DSMs are extracted and in addition for some selected combination of pairs. To be able to compare the coverages of different DSMs the threshold in fusion for populating an output DSM pixel is set to 1/3 of the input pixels. For example, if we have four DSMs and the fusion uses a 3 × 3 pixel kernel, then at least 12 pixels out of the 36 have to be valid such that an output is generated. This rather high threshold is chosen to allow a fair comparison of resulting nodata pixels. Note, that all remaining gaps could be filled by means of interpolation within matching or fusion. However, filling is purposely omitted to be able to recognize if multi-view images have an impact on the reconstruction completeness. Table 12 lists a multitude of statistical values, in particular minimal, maximal, mean, STD, MED, nmAD, mean absolute error (MAE), percentage of nodata values (i.e., 100% − completeness) and the convergence angles. Figure 22 complements the statistics and depicts the distributions of differences from LiDAR and Pléiades DSMs (we took the whole area of the LiDAR acquisition with 345 km 2 for the comparison as in References [130,131]). It can be observed that there are huge outliers both positive and negative. They actually stem from the LiDAR reference data as there some height values in rivers are completely incorrect. As discussed in Section 4.2.6 those outliers lead to non-robust mean and STD estimates. The MAE measure also suffers from outliers and cannot contribute to the interpretation. Next, the plots reveal that four distributions are significantly worse (i.e., larger STD and nmAD) than all others. Those pairs are 1-3, 2-4, 3-5 and 4-6. Those pairs are exactly the four pure across track stereo pairs (cf. Figure 6) with small convergence angles. Due to the outliers, those pairs cannot be detected reliably with the STD estimate but with the nmAD. Two of these distributions show a bias in their mean value, that is, pairs 1-3 and 3-5. Again this behavior is not mapped to the mean values but to the MED values. This just emphasizes that non-robust fitting of a normal distribution based on mean and STD is not the best idea for our DSM difference models as they are non-Gaussian distributed. As discussed above, it can be observed that small intersection angles lead to large DSM completeness (i.e., low nodata percentage) where the 1-3 pair performs best. However, despite the completeness, this pair is rather inaccurate with one of the largest nmAD values. This behavior can be traced back to the small intersection angle, such that the stereo images are more similar and thus, image matching results in a more complete disparity map. However, due to the small intersection angle, the spatial forward intersection yields a larger 3D error. Same holds, for instance, for the 2-4 pair.
We also plotted STD, nmAD and nodata percentages versus the convergence angles as depicted in Figure 23. On the left side, the plots are given using the complete error distribution including all gross outliers. On the right side, only errors with absolute values smaller than 6 m are used such that gross outliers are removed. At first, it can be seen that the STD trends change completely in these two experiments. While using all data the STD is directly correlated to the convergence angle, it is indirectly correlated when removing the gross outliers. In addition, the value range of the STD changes drastically. This behavior again shows that the STD estimate is indeed not robust by any means and should be avoided when interpreting DSM differences. The NMAD plots on the other side yield the same trend and comparable number ranges for both experiments. As expected the nmAD decreases (and thus the accuracy of the DSMs increases) with increasing convergence angles. The percentage of nodata values is directly correlated with the convergence angle and, obviously, the number of nodata values increases in the second experiments as outliers are removed.
Next, combinations of stereo pairs are processed. Groups with small intersection angles in the range of 10 • to 20 • (i.e., pairs 1-3, 2-4, 3-5 and 4-6), medium angles in 20 • to 30 • (i.e., pairs 1-2, 1-4, 1-5, 2-3, 2-6, 3-4, 3-6 and 5-6) and large angles in 30 • to 40 • (i.e., pairs 1-6, 2-5 and 4-5). Here, the medium angles perform best, seen on the low nmAD and low nodata values. Only smaller or only larger angles degrade the resulting DSM. When using all 15 pairs, obviously, the nodata regions are lowest with only 1.1% (i.e, a completeness of 98.9%). Note, that the reference LiDAR data holds 0.44% of nodata values, such that the real completeness is even higher than reported. Also the nmAD of all pairs is decently small, however it is even lower for, for example, the pairs in medium angle range. Here, we conclude that stereo pairs should be selected within a certain intersection angle range, where 20 • to 30 • seem optimal (which correspond to a B/H ratio in the range from 0. 35 Figure 24 visualizes the differences in various DSMs. Next, to the reference LiDAR DSM the Pléiades DSM is shown based on the fusion of all stereo pairs with convergence angles in from 20 • to 30 • . For comparison also the DSM from the good stereo constellation 1-2 is depicted and from the worse performing constellation 1-3. At first, we see that the LiDAR model holds information like the power lines that cannot be reconstructed from Pléiades images (but nevertheless contributes to our error metric). From the Pléiades DSMs the fused one is visually a lot better, since it contains less gross errors (cf. the roof of the hall in the lower part of the image), it is more complete and it is also smoother. Comparing only the single stereo results the pair 1-2 yields a smoother surface than 1-3. Additionally, the roof structure of the central large building is not reconstructed from the 1-3 pair. Overall, this visual comparison supports our assumption that using multi-view Pléiades sets allow higher accurate 3D modeling than single stereo sets. To allow a visual interpretation of the high quality of our results, a subset of the Tian Shui test site is depicted in Figure 25 with 1 m GSD. There, the CIR ortho image is shown together with the relief shaded DSM covering 2001 × 1701 m 2 with heights from 1380 m to 1780 m. The agricultural use of the landscape is clearly visible due to the nicely reconstructed rice terraces.

Digital Terrain Model Generation
Tests are performed on the Innsbruck East triplet data set where the DSM is derived with 1 m GSD. For DSMs with lower or higher resolution, the processing parameters very likely have to be appropriately tuned to achieve optimal filtering results. For all tests the values given in Table 13 have been used.

Qualitative Evaluation
For visual comparison, the results of several ground pixel filtering methods are given in Figure 26 for an area of 721 × 361 m 2 , showing a residential area at a hillside. The figure shows (a) the pansharpened true-ortho image (c) along with the input DSM. Further, it shows (b,d,e) the ground masks as resulting from DSM filtering by (b) using only two directions (i.e., left and right, which corresponds to the method in Reference [93]), (d) using 8 directions and (e) using 8 directions as well as slope dependent processing. It is clearly visible that due to the hillside location the first two methods filter too many points due to the negligence of the local terrain slope, whereas our advanced MSD method clearly yields a highly plausible and reliable segmentation. Forest areas and houses are removed while many ground points are correctly detected. Figure 27 shows a comparison of (a) 3D views of the input DSM and (b) the resulting DTM, which makes the removal of trees and houses very well visible. Figure 28 additionally shows detailed views of an urban and a suburban region.

Quantitative Evaluation
First, the Pléiades tri-stereo based DSM is compared to the LiDAR DSM (cf. Section 4.2.6). Due to the large temporal gap between LiDAR and Pléiades data acquisitions only selected areas are analyzed, which are not affected by temporal changes due to construction, vegetation growth or cloud cover. Mean values as well as STDs of the height differences are summarized in Table 14 for an area of 25.4 km 2 . Here, the Pléiades based DSM has a bias of 0.64 m and thus, is actually too high. However, this is within the height uncertainty given for this sensor [19,26,132]. Consequently, the differences between LiDAR and Pléiades based DTMs will have the same bias, which however is not a result of the presented DTM generation algorithm.
Second, the difference of the reference LiDAR DTM to the Pléiades DTM, which has been extracted using the proposed algorithm, is analyzed. The results are given in Table 14. The mean bias between both DTMs indeed is very similar to the one achieved for the DSMs. Because not all non-ground points are perfectly removed during the Pléiades DTM generation, the achieved DTM is locally above the LiDAR DTM, resulting in an additional mean height difference of 0.11 m. However, this bias is really small and it is below the absolute accuracy of LiDAR height measurement as well as Pléiades stereo height measurement.   Figure 29 shows DSMs and DTMs as generated from LiDAR and Pléiades data for two small areas of 100 × 100 m 2 each, as well as profiles for selected objects, representing (a) a skyscraper and (b) a round building, respectively. Analysis of the plotted profiles indicates that small structures like the buildup on the first skyscraper or the hole in the center of the round building, are not reconstructed using Pléiades data. The DTM generation is able to remove buildings and in some places our DTM looks even better than the LiDAR DTM, for instance, on the left border of the round building where the LiDAR DTM is even below ground level.

Discussion
Researchers and operators require robust and automatic workflows that yield highly accurate 3D mapping products from VHR multi-view stereo satellite data. Thus, this manuscript proposed an end-to-end workflow that yields the desired 3D mapping on the example of the Pléiades satellite constellation. The contributions w.r.t. the methodological development are as follows: 1.
Introduction of the complete end-to-end 3D mapping workflow.

2.
RPC-based sensor model adjustment in object-space, employing Newton's method to solve the non-linear equation systems.

3.
Epipolar rectification, where the method in Reference [54] was modified and the first time applied to optical satellite images.

4.
Interpolative approach in the non-linear epipolar rectification for speed-up.

5.
Combined Census transform with SAD matching cost for increasing the accuracy of image matching. 6.
Explicit modeling of the function which is used to vary the penalty P 2 in SGM. 7.
Extension of classical SGM to a truncated SGM by limiting the local disparity range, resulting in significant speed-up, while yielding comparable accuracy. 8.
Memory optimization in SGM by discretizing the disparity space image to 16 bits. 9.
Explicit modeling of nodata values in the epipolar images during matching for speed-up. 10. Disparity prediction for increasing the robustness and efficiency in top-level image matching. 11. Subpixel normalization of the resulting disparity maps. 12. Extension of the local DSM fusion method to allow downsampling within the fusion step. 13. Novel multi-directional slope dependent DTM generation method, which extends the approach in Reference [93].
Additionally, all steps of the workflow were evaluated by in-depth accuracy assessments based on a multitude of satellite and the related metadata. The assessment itself represents a main contribution of this work. Since the presented methods were implemented within a commercially available software interested audience can directly build upon our findings. Note that the presented workflow was already successfully applied by other research groups, for instance, in References [4,110,[134][135][136]. Especially, the DTM generation was applied in References [102,[137][138][139][140][141][142][143].

Main Findings
From the methodological point of view, we stated that the presented RPC sensor model adjustment in object-space is simple and works well. In contrast to image-space based methods, no additional information like the coefficients of the affine transformation has to be stored. Furthermore, height dependency is an intrinsic part of the adjustment. Epipolar rectification is solved, the presented solution is very general and can also be used for SAR geometries. Several extensions in image matching were presented, mainly for speedup reasons. Details on the adaptive penalty paradigm were presented together with an extension to use truncated disparity space images (again for speedup). For better robustness, the concept of top-level predictions was introduced, which is based on a coarse input DSM, if existing. Subpixel refinement was discussed and a simple solution was described that yields uniform distributed subpixel fraction distribution. Storage concepts for 3D point clouds and their resampling to a regular grid were discussed as well. For DSM fusion a simple method was enhanced to also allow downsampling in the same step. Statistical measures, in particular, the MED and nmAD, were discussed and insights were given why the mean and the STD should not be used to compare DSMs. Regarding DTM generation a novel method was proposed and described in detail. In comparison to its initial publication, it was also extended. The method itself sticks out with its simplicity while yielding good results.
Also, the evaluation of all 24 Pléiades images revealed exciting insights. As expected the initial 2D and thus also 3D geo-location accuracy is inadequate for mapping. It also has been shown that even one along track stereo pair may suffer from relative geometric distortions, which was demonstrated as deviation orthogonal to the epipolar direction. After adjustment, the residual errors dropped significantly. Here, we observed that the lower limit comes mainly from the quality of the reference GCPs and the image measurements. For the Ljubljana multi-view set, where highly precise reference information was available, a 2D and 3D accuracy was achieved as never achieved before (3D accuracies in the range of 0.2 m to 0.3 m in planimetry and 0.2 m to 0.4 m in height w.r.t. GCPs). This can be seen as empirical proof that the proposed sensor model adjustment and spatial forward intersection procedure perform as desired. During the analysis, we detected an accuracy drop within the across track stereo pairs of the Ljubljana test site, which can be traced back to the small convergence angles of those stereo pairs. Since such Pléiades multi-view constellations were never before processed they were also not reported in literature. One specific outcome of the study on the multiple view data set was that convergence angles in the range of 20 • to 30 • , which correspond to a B/H ratio of 0.35 to 0.55 yield best results. It seems that this convergence angle range is the optimal tradeoff in quality between image matching and spatial point intersection. Thus, in cases where many images exist, only stereo pairs with appropriate convergence angle should be used. The comparisons of LiDAR to Pléiades based DSMs showed highly accurate reconstructions, with an nmAD around 0.9 m. It also revealed that robust estimates for characterizing a Gaussian distribution have to be used in remote sensing to avoid incorrect biases when comparing DSMs. Using multiple stereo pairs for DSM generation increased the completeness and the accuracy. Terrain model extraction worked well and introduced no additional errors on the height values. Problems occur in large forests where no ground height value is visible. This aspect may degrade DTMs especially in forested hilly terrains.
Overall, the presented end-to-end workflow delivers mapping products of very high quality.

Future Research Goals
The most difficult blocks within the processing chain were identified as the stereo matching, the fusion process and DTM generation. The first two methods directly contribute to the quality of the resulting DSM. In stereo matching, we should try to increase the plausibility of the cost functions. A realistic way will be to learn the cost function on huge ground truth data sets based on deep learning of a convolution neural network. A second option which seems to be promising is to deeply investigate the generalized Census transform and combine it with a robust cost function that also yields useful correspondences in texture-less regions.
The fusion process should somehow globally optimize the resulting surface, while being computationally feasible. Like in SGM an RMF could be solved in a heuristic manner. Alternatively, global optimizations like in References [86,87] would be of interest as well even though they are computationally complex. Anyhow, like in photogrammetry with increasing resolutions the community will go from DSM as the desired output to a fully 3D point cloud. Thus, the fusion of multiple stereo disparities has then to be performed in 3D yielding a simplified triangulated textured point cloud representation.
As in the fusion process, DTM generation could be solved globally, for example, by a variational approach. In addition, the multi-spectral information, that is, our ortho mosaic may serve as an input in processing. A classification on the multi-spectral data would help to discriminate man-made objects and vegetation from bare earth regions.
To reduce the manual effort in the whole process automatic methods for transferring GCPs from SAR amplitude images to optical images are needed. Actually, the underlying data exists (e.g., worldwide TerraSAR-X acquisition that were used in the WorldDEM generation) and also methodologies for multi-modal image matching. In the optimal case, such a transfer workflow could be directly performed at the satellite data providers. Then, end-users would get optical images and GCPs or optical images with already optimized sensor models.

Conclusions
An end-to-end workflow for mapping with very high-resolution satellite data form the basis for any further semantic analysis. In specific, many applications in remote sensing rely on the following 3D mapping products: (1) DSM, (2) DTM, (3) nDSM and (4) ortho-rectified image mosaic. For this reason, this work described all underlying principles for satellite-based 3D mapping and proposed methods that extract all those products from multi-view stereo satellite imagery. The study was based on, but not limited to, the Pléiades satellite constellation. Besides an in-depth review of related work, the methodological part proposed solutions for each component of the end-to-end workflow. In particular, this included optimization of sensor models represented by rational polynomials, epipolar rectification, image matching, spatial point intersection, data fusion, DTM generation and ortho mosaicing. For each step implementation details were proposed and discussed. Another objective of the study was a detailed assessment of the resulting output products. Hence, a variety of test sites were chosen and data sets were gathered representing different acquisition scenarios. The assessment was based on 24 Pléiades images. First, the accuracies of the 2D and 3D geo-location were analyzed. Second, surface and terrain models were evaluated, including a critical look on the underlying error metrics. The differences between single stereo, tri-stereo and multi-view data sets were analyzed as well. Overall, 3D accuracies in the range of 0.2 m to 0.3 m in planimetry and 0.2 m to 0.4 m in height were achieved w.r.t. ground control points. Retrieved DSMs showed normalized median absolute deviations around 0.9 m in comparison to reference LiDAR data. Multi-view stereo outperformed single stereo in terms of accuracy and completeness of the resulting DSMs.
Author Contributions: R.P. had the original idea for the study. H.R. and P.M.R. supervised the research and contributed to the article's organization. R.P. drafted the manuscript, which was revised and approved by all authors.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The