remote sensing Building Height Extraction from GF-7 Satellite Images Based on Roof Contour Constrained Stereo Matching

: Building height is one of the basic geographic information for planning and analysis in urban construction. It is still very challenging to estimate the accurate height of complex buildings from satellite images, especially for buildings with podium. This paper proposes a solution for building height estimation from GF-7 satellite images by using a roof contour constrained stereo matching algorithm and DSM (Digital Surface Model) based bottom elevation estimation. First, an object-oriented roof matching algorithm is proposed based on building contour to extract accurate building roof elevation from GF-7 stereo image, and DSM generated from the GF-7 stereo images is then used to obtain building bottom elevation. Second, roof contour constrained stereo matching is conducted between backward and forward image blocks, in which the difference of standard deviation maps is used for the similarity measure. To deal with the multi-height problem of podium buildings, the gray difference image is adopted to segment podium buildings, and re-matching is conducted to ﬁnd out their actual heights. Third, the building height is obtained through the elevation difference between the building top and bottom, in which the evaluation of the building bottom is calculated according to the elevation histogram statistics of the building buffer in DSM. Finally, two GF-7 stereo satellite images, collected in Yingde, Guangzhou, and Xi’an, Shanxi, are used for performance evaluation. Besides, the aerial LiDAR point cloud is used for absolute accuracy evaluation. The results demonstrate that compared with other methods, our solution obviously improves the accuracy of height estimation of high-rise buildings. The MAE (Mean Absolute Error) of the estimated building heights in Yingde is 2.31 m, and the MAE of the estimated elevation of building top and bottom is approximately 1.57 m and 1.91 m, respectively. Then the RMSE (Root Mean Square Error) of building top and bottom is 2.01 m and 2.57 m. As for the Xi’an dataset with 7 buildings with podium out of 40 buildings, the MAE of the estimated building height is 1.69 m and the RMSE is 2.34 m. The proposed method can be an effective solution for building height extraction from GF-7 satellite images.


Introduction
With the acceleration of China's urbanization process, the concept of "urban fine management" has been proposed [1]. Urban fine management is designed to solve the problems of blind expansion [2,3], population assessment [4][5][6], urban climate [7,8], destruction of natural environment and cultural heritage [9,10], urban 3D reconstruction [11][12][13][14], etc. As the fundamental element of urban cities, buildings are one of the most important research targets, and their height information plays a critical role in urban exemplary management. Thus, effective solutions are necessarily required in practice. contour constrained stereo matching and podium segmentation, in which the backward image block is matched against the forward image with the similarity metre of feature map difference. Since the podium contains multiple elevation planes, the podium is segmented with image gray difference and rematched to obtain the elevation of multiple planes. (3) The third step is bottom elevation estimation and building height calculation. The bottom elevation is calculated according to the elevation histogram statistics of building buffer by using DSM. The top elevation is calculated using a space intersection algorithm. The building height is then obtained through the difference between the top and bottom elevations.
This paper is organized as follows. Our methodology is presented in Section 2. The experiment data and results are reported in Section 3. Finally, conclusions are drawn in Section 4.

Methodology
This paper proposes a roof contour constrained stereo matching method to extract the accurate height of high-rise buildings from GF-7 satellite images. The overall workflow of the proposed solution is illustrated in Figure 1. The input of our method consists of GF-7 backward and forward images, building roof contours on the backward image, and the corresponding DSM data. The method includes three major steps, as described as follows: (1) Epipolar image block and feature map generation from stereo images. The first step is to prepare the epipolar image block for roof contour matching, in which the parallax search range is determined by using the DSM data and a supposed maximum range of building heights. After image resampling, the epipolar image blocks are transformed into feature map blocks by computing regional standard deviation to extract the structural characteristics. (2) Roof contour constrained stereo matching considering podium structure. The backward image block is first matched against the forward image block with the similarity measure of image feature difference. To deal with the multi-heights problem arising from podium buildings, the gray difference image is used to detect the mismatched contours, which can include podium. The image is then rematched iteratively until no podium structure can be detected. With the matched parallax, the roof elevation is calculated through the space intersection using the RPCs of satellite stereo images and the 2D roof contours on the backward image is transformed into 3D roof contours in the geographic coordinate system. (3) Bottom elevation estimation and building height calculation. Bottom elevation is estimated using histogram analysis of the DSM in the buffer of a building contour with a buffer size of 20 m. Building height is then calculated by the subtraction between bottom elevation and roof elevation.

Epipolar Image Block and Feature Map Generation for Stereo Images
This step aims to prepare epipolar image blocks for roof contour constrained stereo matching. The source block range on the backward image is determined by the building contour itself. The destination block range on the forward image depends on the building roof elevation, which needs to be determined by image matching. So we use DSM for estimating the bottom elevation and plus the maximum possible height for roof elevation. Then the roof contour is transformed from the backward image to the forward image for epipolar resampling. At last, convolution processing is used to extract the standard deviation feature of the epipolar image blocks for decreasing the impact of bright change between the forward image and the backward image.

Calculation of Parallex Search Range
The principle of parallex search range is illustrated in Figure 2. The input building roof contour is on the 2D backward image, and the DSM is in the 3D object space. An iteration algorithm is used to obtain building roof elevation from the DSM data, in which it starts from an initial elevation Z 0 , such as the mean elevation of the study area. Then, the algorithm executes iteratively as follows: (1) With the initial evaluation Z 0 , we can get point P 0 in the 3D object space by intersecting imaging ray through the roof center with the horizontal plane created by the elevation of Z 0 . (2) After obtaining point P 0 , we use the horizontal coordinates X and Y of point P 0 to interpolate a new elevation value of Z 1 from the DSM data. (3) If the elevation difference between Z 1 and Z 0 is greater than a given threshold, then update Z 0 with Z 1 . (4) Steps 1 to 3 are iteratively executed until reaching the terminal condition. By using the above-mentioned algorithm, the minimum elevation can be obtained, which is termed as Z min . Considering the maximum building height h max in the test area, the elevation search range can be calculated as Z min + h max . With the elevation search range (Z min , Z max ), the parallex search range (begin, end) on forward image can be calculated by projecting the points of roof contours on the backward image to the forward image, whose envelope consists of the entire parallex search range on the forward image.

Epipolar Image Block and Feature Map Generation
By using the determined parallex search range, the GF-7 image is resampled by the epipolar transform to generate epipolar image blocks. After resampling the forward and backward images, their y parallax is eliminated. The two-dimensional image matching is simplified as one-dimensional matching in order to improve the matching efficiency [23,41].
The brightness change between the forward image and the backward image can greatly affect the image matching. To decrease the impact of brightness change, we conduct the image matching on the standard deviation maps instead of the original epipolar image block themselves. As shown in Figure 3, we use a convolution of 5 × 5 window to generate standard deviation maps. The larger the standard deviation value, the richer the image pixel information. The standard deviation is calculated with Equation (1) where S (i,j) is the standard deviation value at point (i, j); n is the radius of the window;x is the mean value of the window; x (i+a,j+b) is the pixel in the window.    Figure 5 illustrates the epipolar image blocks and their corresponding standard deviation maps. Figure 5a is the backward image block of a building; Figure 5b is the forward image block to be matched. Due to the imaging view difference, these two image blocks have dramatic bright difference. Figure 5c,d respectively shows the standard deviation map of Figure 5a,b, whose bright difference is greatly reduced. Based on the standard deviation maps, we use the gray difference to measure the similarity between the backward image and the forward image. The gray difference similarity is measured as the absolute gray difference between the template and the searched image, as shown in Equation (2), where g S i+pj is the searched pixel value with x parallax p; w, h is the template window size; g T i+j is the pixel value of the template window, respectively. The parallax p with the smallest gray difference sum is regarded as the matched result.
In order to exclude the interruption of pixels outside the building roof contours, the gray difference is modified to be a masked gray difference, as shown in Equation (3), in which only the roof pixels are taken into account.
where M is the mask matrix of template block; (i,j) is the pixel in the mask set.
As illustrated in Figure 6, a simple building with only one roof height plane can be matched in one pass, in which the mask is generated with the roof contour indicating the inner part of a building roof, and the histogram of the gray difference image has only one obvious peak.

Segmentation and Rematching of Building with Podium
In the previous section, we suppose that the building roof has only one height. That is correct for most buildings. In the real world, buildings, however, have different shapes and structures, and many buildings have several height planes. Such as, the podium building refers to an auxiliary building that is general at the bottom of the main body of a multi-story, high-rise, or super high-rise building. Its floor area is greater than the standard floor area of the main body. As shown in Figure 7, the pink rectangle and blue rectangle belong to one building. However, they have two different height. For simplicity, we call the building part below the main body as podium building. Apparently, we can not estimate the two heights in one pass matching processing. Fortunately, the gray difference map of the matched blocks can provide more clues to refine the initial matching result. The gray difference map data represents the average brightness difference of backward and forward parallax. As shown in Figure 6, the gray difference histogram distribution should be in the range of (0-100) values if the matching result is completely correct. However, there are two peaks in the histogram as shown in Figure 8a, one is in the range of (0-100), and the other is in the range of (200-250). This indicates that there are two different heights in the building, and only one height is matched correctly. To solve this problem, we match the different height of a building with podium using the masked gray difference in an iterative manner. As shown in Figure 8a,b, by using two different masks, the bottom height of the podium is estimated in the first pass, and the top height of the podium is estimated in the second pass.  Figure 9 shows the principle of mask refinement and contour segmentation of podium building. First, the gray difference image (Figure 9a) is binarized to generate the new mask ( Figure 9b). Second, an a connectivity analysis is conducted to erase small regions whose pixel numbers is smaller than a given threshold. The region with maximum number of pixels is selected as the area to be matched ( Figure 9c). Third, the morphological filter is carried out on the selected area (Figure 9c) to get a refined mask (Figure 9d). Finally, one or several sub roof contours (Figure 9e) can be generated using the rectangles covering the sub regions of the refined mask, in which the covering rectangles are parallel to the original roof contour. Based on above idea, the process of matching and segmentation of a building with podium is outlined as follows, as illustrated in Figure 10: (1) Set the mask as the whole roof using the blue contour polygon.
(2) Search the x-parallax with the minimum masked gray difference in the standard deviation map; (3) Compute the gray difference map between matched image blocks; (4) Compute the histogram of the gray difference map in the mask set; (5) If there are more than one peaks in the histogram, the gray difference image is binarized and refined as a new mask for next matching. (6) Repeat steps 2 to 5 until 90% pixel is matched.

Bottom Elevation Estimation and Building Height Calculation
Three steps are used to estimate the building bottom evaluation and building height. First, combined with the matched x-parallax, the ground point of the building contour center can be computed using the intersection of the two homogeneous imaging rays of the stereo image pair [42]. Then using the RPCs of the backward image, the contour of the building roof on the backward image can be transformed into the object space as the ground footprint, which is on the horizontal plane defined by the elevation of the building center. Second, as shown in Figure 11, the bottom elevation is estimated by using elevation histogram analysis of DSM values, which is created in a 20 m buffer around the building footprint. Then the elevation at the minimum peak of the histogram is chosen as the bottom elevation. Third, the building height is obtained by subtracting the bottom elevation from the roof elevation.

Study Area
There are two datasets used in this study for performance evaluation, as shown in Figure 12. Dataset one locates in Yingde City, Guangdong Province, between 113.31-113.50°E, 24.25-24.41°N. This study area contains more than 8000 buildings with their footprints in shapefile format, whose height ranges from 10 m to 100 m. By using aerial LiDAR data, we estimated the roof elevation, bottom elevation, and building height as ground truth. With the GF-7 images, we used ENVI and INPHO to generate DSM, which can be used to estimate building height directly or as the input of our roof contour constrained matching method. Then, our result and DSM based result were compared with the LiDAR based ground truth. In addition, the shadow based result was also compared. Dataset two locates in Xi'an high tech Industrial Development Zone, Shanxi Province, between 108.85-108.89°E, 34.18-34.24°N. In this area, we focus on high-rise buildings and podium buildings. In this test site, there are no building footprints and LiDAR data to generate ground truth. Instead, we took the building heights measured from GF-7 stereo images as reference values. We selected 40 buildings, which include 33 single buildings and 7 podium buildings. The height distribution of this dataset ranges from 20 m to 350 m, and the average building height is about 95.55 m, among which the Xi'an Guorui financial center with a height of 348.384 m is the highest building.
3.1.2. GF-7 Stereo Images GF-7 satellite stereo images are used to extract building height. The parameters of used GF-7 satellite stereo images are presented in Table 1. GF-7 satellite is equipped with payloads, including a dual-line array camera and a laser altimeter. The three-line array camera can effectively obtain panchromatic stereo images with a width of 20 km and a resolution better than 0.8 m. In addition, infrared multispectral images with a resolution of 2.6 m can also be obtained. The infrared multi-spectral camera is aligned with the backward panchromatic camera for pansharpening. The backward direction is nearly nadir, designed for balance between less occlusion and bigger stereo intersection angle. Thus, the backward image is better for the extraction of building roof contours.

LiDAR Data
The LiDAR data used in this paper was acquired with Leica CityMapper-2 [43], which is an airborne hybrid sensor of oblique camera and LiDAR. The flight altitude is about 1450 m and the point cloud density in Yingde is 8.1 points/m 2 . As an example, a small area of LiDAR point cloud in Yingde is shown in Figure 13.

Building Roof Contours and Ground Truth
Our method supposes that building roof contours on the backward image are available for matching against the forward image. For the experiment purpose, building roof contours are generated from footprint data or digitized directly on the backward image. In the practical processing of city scale data, the building contour can be generated by semantic segmentation of backward image.
In the Yingde study area, roof contours are based on the building footprint, which were provided by Hubei Sunrise Photogrammetric Company. Based on the histogram analysis of the point cloud in the buffer of a building footprint, the roof elevation and bottom elevation of a building can be estimated, then the building footprint with roof elevation can be projection onto the backward image as the input roof contour for our workflow. Partial roof contours are shown in Figure 14a.
In the Xi'an study area, we directly digitized 40 buildings' roof contours on the backward image with the ArcMap software, as shown in Figure 14b.
The ground truth of building height was calculated according to the roof elevation and bottom elevation. As mentioned ahead, the roof elevation and bottom elevation in the Yingde study area were estimated with LiDAR data. In the Xi'an study area, the roof elevation and bottom elevation were measured in a self-developed stereo model viewer software.

DSM Generation
The input DSM in the workflow(shown in Figure 1) is generated by a self-developed dense matching software based on the pyramid semi-global algorithm, which is silimar to SURE [44].
The DSM used in the Section 3.3.2 is generated by commercial software ENVI and INPHO, which are industry leading software.
As shown in Figure 15, five control points were selected from the LiDAR point cloud and the positioning accurarcy of GF-7 stereo image pair was imporved with RPC modification [45] before DSM generation.
For the ENVI, the parameters are set as follows: the minimum overlap is 55; the matching threshold is 15; the edge threshold is 5; the quality threshold is 60; the terrain type is Flat.
For INPHO generating DSM, the parameters are set as follows: the terrain type is Flat; the feature density is Dense; the point cloud density is 3 pixels; the Parallax threshold is 20 pixels. Three block of DSM results generated from ENVI and INPHO are shown in Figure 16.

Evaluation Metrics
Height accuracy is evaluated by comparing the estimated building height and the reference building height. Three metrics are used for performance evaluation, including the mean absolute error (MAE), root mean square error (RMSE), and max absolute error (maxAE), which are formulated as Equations (4)-(6): where, h i is the estimated building,ĥ i is the reference building height.

Experimental Results of Yingde Dataset
In Yingde study area, there are a total number of 8653 buildings, whose height is less than 100 m. For accuracy evaluation, the ground truth of building height is generated from aerial LiDAR data. Table 2 Table 3 shows the number and accuracy statistic of building height at varying elevation ranges. The results show that the overall building roof elevation estimation is relatively stable, and the bottom elevation estimation accuracy fluctuates wildly since the DSM is easily affected by the occlusion of trees and buildings.

Comparison with the Shadow-Based Method
The proposed solution is compared with the shadow-based method in this section. Two sub datasets that are extracted from Yingde dateset, are used for evaluation. The first sub dataset is a low-rise building dataset with the height ranging between 0-30 m, and there are 299 buildings in the low-rise building dataset; the second sub dataset is a high-rise building dataset with the height larger than 30 m, and there are 170 buildings in the high-rise building dataset. According to Xie [20], when the ground height is horizontal, the shadow length of a building is directly proportional to the height of the building. Based on this observation, we measure the shadow length of buildings manually to avoid the deficiencies of automatic extraction methods. Because the inclined angles of forward camera and backward camera are different in GF-7 satellite, we estimate the building height on both the forward image and the backward image for extensive comparison.
The statistic results of shadow-based method and the proposed solution are shown in Tables 4 and 5. Table 4 shows the comparison of accuracy statistics of building height. According to the experimental results, we can conclude that: (1) the MAE of the shadow estimation method on the high-rise building dataset is less than that of the low-rise building; (2) In low-rise building dataset, due to the occluded image in the shadow area, 10% buildings in the forward and 36% buildings in the backward participate in the metric calculation. The average accuracy of the proposed solution in terms of MAE, RMSE and maxAE is significantly higher than that of the shadow-based method; (3) In highrise building dataset, 45% the buildings in forward and 59% the buildings in backward participate in the calculation. Similarly, the accuracy of the proposed solution is significantly higher than that of the shadow-based method for the three metrics.
In addition, Table 5 shows the comparison of number statistic of building height at varying elevation ranges, in which the term occlusion indicates the number of buildings that occluded. we can see that for the low-rise building dataset, 268 (89%) buildings have shadow occlusions on the forward image, and 190 (64%) buildings have shadow occlusions on the backward image; for the high-rise building dataset, 93 (55%) buildings have shadow occlusions on the forward image, and 69 (40%) buildings have shadow occlusions on the backward image. The height of all the occluded buildings cannot be estimated. On the contrary, the proposed method can successfully estimate the height of all buildings. For a visual interpretation, Figure 17 illustrates some shadow occlusion cases: 1 Sheltered by trees; 2 blocked by nearby buildings. As shown in Figure 17a,c, the shadows at the edge of the house are blocked by trees, which belongs to case 1; On the backward image in Figure 17b, the shadow is not obscured due to the small inclination of the camera; however, in Figure 17d, the shadow on the forward image is occluded by the buildings in the backward and forward rows. It can be seen that the shadow-based method has obvious limitations when applied to GF-7 images. In a word, a majority of buildings cannot be extracted by using the shadow-based method due to the serious occlusion that occur in GF-7 satellite images, especially for the forward image.

Comparison with DSM Based Method
In this section we compare the proposed solution with DSM Based Method. We used ENVI (version 5.6) and INPHO (version 12.1) to generate DSM in Yingde. After generating the DSM, we overlaped the building contour vector onto the DSM for building height extraction. The roof elevation and the bottom elevation are estimated with histogram analysis, which has been discussed in the Section 2.3. The peaks of maximum elevation and minimum elevation are selected as of the building roof and the bottom. Shown in Figure 16, DSM generated by INPHO is relatively smooth, however, it loses some high-rise heights. On the contrary, the DSM generated by ENVI can retain high-rise buildings better, however, it can contain some holes. Table 6 lists the accuracy statistic of building roof elevation from the DSM based methods, and the distribution of absolute error (AE) of ENVI and INPHO is presented in Figure 18. The results show that ENVI achieves better accuracy than INPHO for building roof height estimation in the two datasets. For both ENVI and INPHO, the estimation accuracy of low-rise buildings is better than that of high-rise buildings. As was expected, our method achieves the highest precision. In a conclusion, INHPO and ENVI have limitations in roof estimation of high-rise buildings when applied to GF-7 images, and the performance of our method is very stable for both high and low buildings. Figure 19 shows the 3D reconstruction models of the proposed solution in this dataset.

Experimental Results of Xi'an Dataset
In the dataset of Xi'an, there are a total number of 40 buildings, in which 7 buildings are with podium. Similar to the procedure presented in Section 3.3.2, the proposed method is compared with DSM based results of ENVI and INPHO. Table 7 shows the accuracy statistic for this dataset. Figure 20 shows the correlation the estimated building heights and their reference values, in which the dots nearer the dashed line have the higher estimation precision. From the results listed in Table 7, we can conclude that our method achieves the best performance, and obviously improves the accuracy of building height estimation; its MAE and RMSE of building height estimation is 1.69 m and 2.34 m, respectively. By further observation from Figure 20, we can see that within the range of 0-100 m, ENVI can achieve comparable accuracy when compared with the proposed method. However, with the increase of building height, its performance dramatically decrease. It can be seen from the green dots, which are nearly on the dashed line, our method archives consistently high precision within the entire building height range.  In the Xi'an dataset, the building heights of the 7 podium buildings are evaluated, in which more than one roof elevations are extracted using our iterative roof matching method. For performance evaluation, the height of each podium building is divided into three parts, and their heights are measured manually from GF-7 stereo image as the reference values. Table 8 shows the statistical results of building height estimation, which lists the value of the first estimated building height by using the term Whole and height for each part after the podium segmentation by using the terms Part1, Part2 and Part3. Besides, the empty values in Table 8 represent that the part of the second podium is completely covered by other parts, and the ground truth building height cannot be estimated, as shown in Figure 21. The results show that for each part of the building with podium, the proposed method by using podium segmentation can find several elevation planes in a building contour, and can also accurately estimate their height. By the further analysis, we conclude that the main error of height estimation comes from the matching error caused by partial or entire occlusion of building roofs. Figure 22 shows the 3D reconstruction models of the proposed solution in this dataset.

Conclusions
This study proposes a method for extracting building height information from GF-7 satellite stereo images. First, an object-oriented roof matching algorithm is proposed based on building contour to extract accurate building roof elevation from GF-7 stereo image, and DSM generated by business software is then used to obtain building bottom elevation. Second, to cope with buildings with multiple level height plane, a mask gray difference metric is proposed to search multiple elevation planes and segment the building contour. Finally, by using ground truth data from LiDAR point clouds or manually measured building height, the proposed solution is extensively evaluated and compared with shadowbased method and DSM based method. The experimental results demonstrate that the proposed solution can achieve the best performance and could be an useful solution for accurate and automatic building height information extraction from GF-7 satellite stereo images.
Compared with the DSM and shadow-based method, our algorithm skillfully solves the problem of height estimation of high-rise buildings. Compared with deep learning, this algorithm does not need training data set. For the first time, we focused on the problem of multiple elevation planes within the height of a building.