A Multi-View Dense Image Matching Method for High-Resolution Aerial Imagery Based on a Graph Network

: Multi-view dense matching is a crucial process in automatic 3D reconstruction and mapping applications. In this paper, we present a robust and effective multi-view dense matching algorithm for high-resolution aerial images based on a graph network. The overlap ratio and intersection angle between image pairs are used to ﬁnd candidate stereo pairs and build the graph network. A Coarse-to-Fine strategy based on an improved Semi-Global Matching algorithm is applied for disparity computation across stereo pairs. Based on the constructed graph, point clouds of base views are generated by triangulating all connected image nodes, followed by a fusion process with the average reprojection error as a priority measure. The proposed method was successfully applied in experiments on aerial image test dataset provided by the ISPRS of Vaihingen, Germany and an oblique nadir image block of Zürich, Switzerland, using three kinds of matching conﬁgurations. The proposed method was compared to other state-of-art methods, SURE and PhotoScan. The results demonstrate that the proposed method delivers matches at higher completeness, efﬁciency, and accuracy than the other methods tested; the RMS for average reprojection error reached the sub pixel level and the actual positioning deviation was better than 1.5 GSD.


Introduction
With the improvement of camera technologies and the rise of new matching approaches, more and more image-based software for 3D modeling and reconstruction have been developed, such as Street Factory, Accute3D, PhotoScan, in which a multi-view dense image matching (DIM) algorithm is one of the most crucial processes.The goal of multi-view dense matching is to extract dense point clouds from multiple images that have known orientated parameters.Image-based surveying and 3D modeling can now deliver point clouds with accuracy comparable to those produced by laser scanning [1] for many terrestrial and aerial applications in a reasonable time.Since the inherent nature of multi-spectral images, rich texture information can be accessed.Moreover, the absolute accuracy of a point cloud can be assessed based on the redundant measurements extracted from imagery.Image-based 3D reconstruction is widely used for 3D modeling, mapping, robotics, and navigation because it is lightweight, convenient, cost effective, and generates textured point clouds comparable to those produced by LiDAR systems.
Dense image matching algorithms can be divided into stereo pair based algorithms [2-5] and multi-view based algorithms [6][7][8][9][10][11][12] according to type of processing units.They can also be divided into local and global algorithms according to different optimization strategies [13].Local (window-based) methods compute disparity at a given point using intensity values with implicit smoothing assumptions, while global methods make explicit smoothness assumptions and solve for a global optimization problem using an energy minimization approach, based on regularized Markov fields (MRFs), graph-cut, dynamic programming, or max-flow methods [1].The global methods often yield better performance than local methods, but with a more complex algorithm.Semi-global matching (SGM) algorithms [2,3,14] use pixelwise, Mutual Information (MI)-based matching supported by a smoothness constraint expressed as a global cost function, and performs a fast approximation by pathwise optimizations from all directions, thus representing a compromise between performance and complexity.An extensive study of different matching costs [15] showed that a census matching cost is the most robust technique for stereo vision, and performs better for matching local radiometric changes than the MI cost in many real-world applications.
The four state-of-art software; SURE, MicMac, PMVS, and PhotoScan were evaluated in terms of similarities and differences in approach [1].SURE [5,16] is based on SGM method followed by a fusion step in which redundant depth estimations across single stereo models are merged.Micmac [7,17] implements a coarse-to-fine extension of the maximum-flow image-matching algorithm across multi-resolution and multi-image presented in [18].PMVS [10] is a region growing method that expands a sparse set of matched key points to nearby pixel correspondences before using visibility constraints to filter out false matches.PhotoScan uses a stereo SGM-like method, a proprietary commercial package.Experiments showed that the SURE and PhotoScan approaches that adopted a SGM or SGM-like method are more accurate and efficient than the PMVS and Micmac approaches.
The stereo DIM algorithm in our implementation is based on the tSGM (SGM with a tube shaped disparity range) algorithm in SURE; it is an improvement on the SGM algorithm in memory demand and processing time.The tSGM algorithm provides more a complete reconstruction of lightly textured objects and objects with a repetitive texture.However, edges are not reconstructed as clearly as in SGM algorithm, particularly in the case of close range photogrammetry.For this reason, a guided median filter was introduced to preserve depth discontinuities instead of the median filter deployed in the current method.Within MVS methods in image space, disparity estimations for single stereo models are typically merged in order to increase the reliability and precision of the final depth maps or point clouds.The original SGM algorithm [3] proposes an orthographic 2.5D projection method to fuse several disparity images from different viewpoints for aerial applications; this method is not applicable for close range and oblique aerial applications because too much information would be lost.SURE [5] presents a method to transform all disparity maps of a base image to the distance maps between the camera center and object points on the optical ray, then fuses them by minimizing the reprojection error.However, the point clouds generated from different base images result in small offsets since they are triangulated separately using different set of images.In our approach, a multi-view DIM point cloud generation method based on a graph network is introduced, integrating a correspondence linking technique as presented in [19].The main steps of the current method are as follows: First, a graph network is built considering the overlap relationship and intersection angle between adjacent images.Second, epipolar images are generated and a modified tSGM algorithm is applied to obtain dual direction disparity maps across stereo pairs.A multi-view triangulation approach based on the graph network is adopted to calculate a dense point cloud for a single base view.Finally, multi-view dense point clouds are fused with average reprojection error as a priority measure to generate a final point cloud.
The main contribution of this paper are: (1) Proposing a graph network based multi-view DIM method that computes fewer stereo pairs than traditional MVS methods like SURE and PhotoScan, while obtaining better completeness and higher accuracy.The triangulated point clouds from different base images in the proposed method share the same correspondences with no offset and can be seamlessly integrated; (2) We introduce a guided median filter into the tSGM algorithm to preserve depth discontinuities for better performance in difficult areas with sharp discontinuities, weak textures, or repeated textures.This paper is organized as follows: Section 2 describes the graph network based multi-view DIM algorithm, where Section 2.1 describes the construction of graph network.Section 2.2 presents the improved tSGM algorithm for stereo dense matching.Section 2.3 shows the graph network based triangulation method of base view.Section 2.4 introduces the multi-view point-cloud fusion approach.Section 3 presents experiments and results, followed by the discussion in Section 4. Section 5 draws some conclusions.

Graph Network Based DIM Methodology
In this section, an effective graph network based approach for multi-view dense matching of high-resolution imagery is presented.The proposed method can be divided into four main steps.In the first step, candidate stereo pairs are selected considering overlap relationships and intersection angle to construct the multi-view graph network.In the second step, epipolar images are generated followed by processing with the modified tSGM algorithm to calculate dual-direction disparity maps.In the third step, a graph network based triangulation is adopted to generate a dense point cloud of base views from all the connected images, including those outside the candidate matches.Redundancy is exploited through the correspondence linking technique to eliminate error and increase the accuracy of triangulation, while enabling assessment of the accuracy of the average reprojection error.Finally, the multi-view point clouds are fused with the average reprojection error as a priority measure.Details of those specific principles and implementation processes are provided in the subsequent sub sections.triangulation method of base view.Section 2.4 introduces the multi-view point-cloud fusion approach.Section 3 presents experiments and results, followed by the discussion in Section 4. Section 5 draws some conclusions.

Graph Network Based DIM Methodology
In this section, an effective graph network based approach for multi-view dense matching of high-resolution imagery is presented.The proposed method can be divided into four main steps.In the first step, candidate stereo pairs are selected considering overlap relationships and intersection angle to construct the multi-view graph network.In the second step, epipolar images are generated followed by processing with the modified tSGM algorithm to calculate dual-direction disparity maps.In the third step, a graph network based triangulation is adopted to generate a dense point cloud of base views from all the connected images, including those outside the candidate matches.Redundancy is exploited through the correspondence linking technique to eliminate error and increase the accuracy of triangulation, while enabling assessment of the accuracy of the average reprojection error.Finally, the multi-view point clouds are fused with the average reprojection error as a priority measure.Details of those specific principles and implementation processes are provided in the subsequent sub sections.Overlap ratio and intersection angle are then used to determine the candidate stereo pairs.The overlap ratio can be defined in two ways as illustrated in Figure 2.

Graph Network Construction
ε=S o /min(S1,S2) ε=min(S o1 /S1,S o2 /S2)  Overlap ratio and intersection angle are then used to determine the candidate stereo pairs.The overlap ratio can be defined in two ways as illustrated in Figure 2.
In an aerial situation, the corners of image frame are projected to a common plane; then the overlap ratio can be calculated by ε = S o /min(S 1 , S 2 ) where S o , S 1 , S 2 represent the area and overlap area respectively.In more general cases, the matched features obtained after a SFM (Structure from Motion) or BA (Bundle Adjustment) process are used to define the overlap ratio as ε = min(S o1 /S 1 , S o2 /S 2 ) where S o1 , S o2 represent the minimum bounding box of matched features.When the matches are not available, coarse-image matching can be executed with low-resolution images for fast processing.Overlap ratio and intersection angle are then used to determine the candidate stereo pairs.The overlap ratio can be defined in two ways as illustrated in Figure 2.
ε=S o /min(S1,S2) ε=min(S o1 /S1,S o2 /S2)  The intersection angle can be calculated by the two optical axis vectors of the cameras, and only those with an intersection angle less than a threshold (ten degrees is the default) are considered.More occlusion appears between images with a larger intersection angle; adding difficulty to dense image matching and reducing the quality of matching results.Stereo pairs outside of target graph G however, can be connected by certain paths through image pairs with high overlap, thus making full use of the redundant measurements and therefore improving matching accuracy.

Dense Stereo Matching
In this section, the implemented module for dense stereo matching is described.In Section 2.2.1, the details of epipolar rectification are presented based on the algorithms in [20][21][22][23].In Section 2.2.2, a modified hierarchical strategy based on tSGM algorithm is proposed by introducing the guided median filter that efficiently preserves local structure.

Epipolar Rectification
Epipolar rectification must be done before stereo dense matching, which reduces the disparity search space to one dimension.Suppose the camera intrinsic matrix is K, the extrinsic parameters are R l , T l and R r , T r ; and the relative transformation is T = T r − R•T l where R = R r •R −1 l ; then, the rotation matrix R rect for rectification between an original image and epipolar image can be calculated as follows: The new rotation matrix for the rectified image is: Thus, the homographic matrix between the original image and epipolar image is: After deriving the homographies H l , H r , the gray values for pixels at integer positions in the rectified frames are calculated by interpolation in Equation (5) using the correspondence coordinates.

Modifications of the tSGM Algorithm
The SGM algorithm is often the technique of choice in real world applications as it generates dense reconstruction results with highly robust parameterization and capability for real-time processing.This algorithm however, creates extensive memory demands given the large format frames required and the high number of potential correspondences.
The tSGM algorithm in SURE provides a hierarchical coarse-to-fine solution for the SGM method to limit disparity search ranges and decreases the memory demand as well as processing time.The key advantage of tSGM algorithm is that it makes full use of disparity images matched at low-resolution pyramids to limit the disparity search range at next level pyramids, which dramatically decrease the memory demands and processing times.However, edges are not reconstructed as clearly as in SGM algorithm.
Inspired by the joint bilateral filter [24] and guided filter [25], which have the edge-preserving smoothing property, a guided median filter [26] was introduced to replace the traditional median filter in this paper, with left epipolar image as the guidance image and the output of a pixel is a weighted average of nearby pixels in the guidance image.The pixels are weighted in the local histograms: where N (x) is a local window near x, V is the pixel value, i is the discrete bin index, δ (•) is the Kronecker delta function: δ (•) is 1 when the argument is 0, and is 0 otherwise, and w is the bilateral weight that suppresses the pixels with different color from the center pixel.It is straightforward to pick the median value through accumulating this histogram.Comparing the stereo matching results by guided median filter and traditional median filter as shown in Figure 3, the disparities in areas with sharp discontinuities obtained better results due to the introduction of the guided median filter.
Remote Sens. 2016, 8, 799 5 of 18 The tSGM algorithm in SURE provides a hierarchical coarse-to-fine solution for the SGM method to limit disparity search ranges and decreases the memory demand as well as processing time.The key advantage of tSGM algorithm is that it makes full use of disparity images matched at lowresolution pyramids to limit the disparity search range at next level pyramids, which dramatically decrease the memory demands and processing times.However, edges are not reconstructed as clearly as in SGM algorithm.
Inspired by the joint bilateral filter [24] and guided filter [25], which have the edge-preserving smoothing property, a guided median filter [26] was introduced to replace the traditional median filter in this paper, with left epipolar image as the guidance image and the output of a pixel is a weighted average of nearby pixels in the guidance image.The pixels are weighted in the local histograms: where is a local window near , is the pixel value, is the discrete bin index, • is the Kronecker delta function: • is 1 when the argument is 0, and is 0 otherwise, and w is the bilateral weight that suppresses the pixels with different color from the center pixel.It is straightforward to pick the median value through accumulating this histogram.Comparing the stereo matching results by guided median filter and traditional median filter as shown in Figure 3, the disparities in areas with sharp discontinuities obtained better results due to the introduction of the guided median filter.The hierarchical approach in our implementation is a little different from tSGM algorithm.One advantage of guided median filter in the tSGM algorithm is that a canny edge detector is no longer necessary to optimize the penalty parameter P2, thus increasing the efficiency of the algorithm.When matching an image pair fin a certain pyramid level , the disparity between base and match images is computed followed by a consistency check to obtain valid disparity pixels with high confidence.The guided median filter is adopted to generate a median disparity image.If a pixel is valid, a small 15 × 15 window is searched; else, a larger 31 × 31 window is used to obtain the median value of all valid disparities.After these steps, a speckle-filtering algorithm in OpenCV library [27] is exploited to extract the regions of noise or holes followed by interpolation, again using a guided median filter.The maximum and minimum disparity images , are calculated from the final median disparity image with the search ranges for valid and invalid pixels by 16 and 32 according to where is the prior maximum disparity of the image pair and r is the half size of search ranges, then path accumulation keeps the same with tSGM algorithm.Considering the direction of edges in The hierarchical approach in our implementation is a little different from tSGM algorithm.One advantage of guided median filter in the tSGM algorithm is that a canny edge detector is no longer necessary to optimize the penalty parameter P2, thus increasing the efficiency of the algorithm.When matching an image pair fin a certain pyramid level l, the disparity between base and match images is computed followed by a consistency check to obtain valid disparity pixels with high confidence.The guided median filter is adopted to generate a median disparity image.If a pixel is valid, a small 15 × 15 window is searched; else, a larger 31 × 31 window is used to obtain the median value of all valid disparities.After these steps, a speckle-filtering algorithm in OpenCV library [27] is exploited to extract the regions of noise or holes followed by interpolation, again using a guided median filter.The maximum and minimum disparity images R l max , R l min are calculated from the final median disparity image D l with the search ranges for valid and invalid pixels by 16 and 32 according to where D max is the prior maximum disparity of the image pair and r is the half size of search ranges, then path accumulation keeps the same with tSGM algorithm.Considering the direction of edges in graph network, the depth maps of the reference image and matched image are both computed and output for subsequent processing, as shown in Figure 4.
Remote Sens. 2016, 8, 799 6 of 18 graph network, the depth maps of the reference image and matched image are both computed and output for subsequent processing, as shown in Figure 4.

Graph Network Based Triangulation
The geometric correspondence relationship across multi-view images based on graph network is shown in Figure 5.For a pixel on image , the correspondence on image can be calculated as follows: First its correspondence coordinate on left epipolar image is calculated using the homographic matrix .Then, the matched pixel on right epipolar image is easily obtained from the corresponding disparity of base image.Finally, the correspondence pixel is determined by multiplying the inverted homographic matrix .The whole transformation is as follows:

Graph Network Based Triangulation
The geometric correspondence relationship across multi-view images based on graph network is shown in Figure 5.

Graph Network Based Triangulation
The geometric correspondence relationship across multi-view images based on graph network is shown in Figure 5.For a pixel on image , the correspondence on image can be calculated as follows: First its correspondence coordinate on left epipolar image is calculated using the homographic matrix .Then, the matched pixel on right epipolar image is easily obtained from the corresponding disparity of base image.Finally, the correspondence pixel is determined by multiplying the inverted homographic matrix .The whole transformation is as follows: For a pixel x i on image I i , the correspondence x j on image I j can be calculated as follows: First its correspondence coordinate on left epipolar image is calculated using the homographic matrix H m−i .Then, the matched pixel on right epipolar image is easily obtained from the corresponding disparity of base image.Finally, the correspondence pixel x j is determined by multiplying the inverted homographic matrix H m−j .The whole transformation is as follows: Since the transformed coordinate is not integer, bilinear interpolation will be exploited to calculate the value on the disparity image.Moreover, the disparity image of a base image or matched image has to be chosen according to the direction of arcs in graph network.For a path with length of d from image I i to image I n , the correspondence on image I n can be computed along the path by applying the transformation in Equation ( 8), for d times.To reconstruct the dense results for a reference image, a full graph search is utilized to find the candidate images for triangulation.However, only those that meet the minimum overlap conditions (10% as the default) and intersection angle (10 degrees as the default) are maintained.After all the matches across the multi-view images in the graph network are computed, the corresponding object point X i can be calculated using multi-view triangulation according to where P i represents the camera projection matrix of image I i , and the average reprojection error can be calculated by Despite the use of consistency checking, speckle removal, and guided median filtering, there is still some error.By setting the minimum number of image matches and the threshold for reprojection error during triangulation, potential error can be efficiently eliminated, which greatly reduces the complexity and runtime in the subsequent filtering process.Compared to direct triangulation from stereo pairs, graph network based triangulation makes full use of redundant measurements, resulting in higher accuracy for reconstructed dense point clouds.Due to the use of graph network, the triangulation of the same object point from different base images will get the same results.The dense point clouds extracted from different images therefore, will have a consistent coordinate system and can be more easily fused.
Benefiting from the transitivity of graph network, there is no need to compute stereo matching for image pairs with low overlap directly, resulting in different matching configurations for the proposed method.Three different matching configurations were designed and tested.

Multi-View Point Cloud Fusion
Considering the influence of perspective, occlusion, and mismatching, point clouds generated from different views need be fused and filtered so that the scenario results are complete.Since there is a high density of points in overlapping areas, resampling is applied to generate a final point cloud.The main steps of point cloud fusion in the current implementation are as follows: (1) Using noise filtering algorithm in Point Cloud Library (PCL) [28] to remove the remaining mismatches of every single point clouds.Since most mismatches are excluded during the multi-view consistency checking process, the number of those to be eliminated is small; (2) Calculating the minimum bounding box of the target survey areas and partition task units by a tiling mechanism to make the task parallelization and accelerate the progress; (3) Within each task unit, point clouds that intersect with a bounding box are fused and resampled, using the K-d tree algorithm.The average reprojection error is the priority measure.

Description of the Test Dataset and Experiments
To verify the effectiveness of the graph network-based multi-view DIM method proposed in this paper, in our experiments two public image datasets in an ISPRS test project were used as shown in Figure 6.

Description of the Test Dataset and Experiments
To verify the effectiveness of the graph network-based multi-view DIM method proposed in this paper, in our experiments two public image datasets in an ISPRS test project were used as shown in Figure 6.The first dataset was captured over the city of Zürich with a Leica RCD30 Oblique Penta medium format camera with a GSD (Ground Sample Distance) of 6 cm.The test area to be processed is depicted in Figure 6 by a red rectangle overlaid on an ortho image.In these experiments, only the nadir images of an oblique image block were used.The second dataset was captured over Vaihingen in Germany; it is a part of an Intergraph/ZI DMC block at 8 cm ground resolution.This dataset was collected with a Leica ALS50 system with the average point density of about 4 points/m 2 .This was used as reference data to evaluate actual positioning accuracy.Descriptive information for the two datasets is shown in Table 1.The proposed algorithm was implemented by VC++ 2010 and integrated to our Photogrammetric Package; Mogas, which adopted a Census and MI matching cost for computing the disparity volume.In our experiments, for better performance, Census cost is used to calculate the disparity space image.The related algorithm parameters are set as follows: P1, P2 are set to 30 and 150 in SGM algorithm, the radius of guided median window is set to 19, and the speckle size is set to 256, corresponding to original resolution.All the parameter settings were kept the same for both Zürich and Vaihingen dataset.To reduce memory usage and processing time, the images are down-sampled by half before the dense matching process.The experiments were conducted on a Dell Precision T7600 workstation with a 64-bit Window 7 operation system, a sixteen Intel Xeon E5-2650 M CPU, 2.0 GHz and 64 GB memory.The first dataset was captured over the city of Zürich with a Leica RCD30 Oblique Penta medium format camera with a GSD (Ground Sample Distance) of 6 cm.The test area to be processed is depicted in Figure 6 by a red rectangle overlaid on an ortho image.In these experiments, only the nadir images of an oblique image block were used.The second dataset was captured over Vaihingen in Germany; it is a part of an Intergraph/ZI DMC block at 8 cm ground resolution.This dataset was collected with a Leica ALS50 system with the average point density of about 4 points/m 2 .This was used as reference data to evaluate actual positioning accuracy.Descriptive information for the two datasets is shown in Table 1.The proposed algorithm was implemented by VC++ 2010 and integrated to our Photogrammetric Package; Mogas, which adopted a Census and MI matching cost for computing the disparity volume.In our experiments, for better performance, Census cost is used to calculate the disparity space image.The related algorithm parameters are set as follows: P1, P2 are set to 30 and 150 in SGM algorithm, the radius of guided median window is set to 19, and the speckle size is set to 256, corresponding to original resolution.All the parameter settings were kept the same for both Zürich and Vaihingen dataset.To reduce memory usage and processing time, the images are down-sampled by half before the dense matching process.The experiments were conducted on a Dell Precision T7600 workstation with a 64-bit Window 7 operation system, a sixteen Intel Xeon E5-2650 M CPU, 2.0 GHz and 64 GB memory.
To explore the most efficient approach with the highest accuracy for multi-view dense matching based on a graph network, three kinds of matching configurations were designed and tested, as illustrated in Figure 7.

Remote Sens. 2016, 8, 799 9 of 18
To explore the most efficient approach with the highest accuracy for multi-view dense matching based on a graph network, three kinds of matching configurations were designed and tested, as illustrated in Figure 7.

Multi-View DIM Results of the Proposed Method Using the Zürich Dataset
To validate the effectiveness and geometric accuracy of the proposed algorithm, the dense point cloud of image-954 of the Zürich dataset is shown together with its height map and accuracy map produced using first configuration, generated according to the average reprojection error during the triangulation process, shown in Figure 8. Figure 8 shows that the proposed algorithm is effective for multi-view dense matching with high completeness and precision; the RMSE of average projection error was around 0.60 pixel.Furthermore, it can be seen that most of the points with large error are located on the edge of tall buildings, areas of vegetation, and occluded areas as seen in the accuracy map.This result was consistent with the results reported in the literature [13,29].For these areas with large error, dense image matching often

Multi-View DIM Results of the Proposed Method Using the Zürich Dataset
To validate the effectiveness and geometric accuracy of the proposed algorithm, the dense point cloud of image-954 of the Zürich dataset is shown together with its height map and accuracy map produced using first configuration, generated according to the average reprojection error during the triangulation process, shown in Figure 8.

Remote Sens. 2016, 8, 799 9 of 18
To explore the most efficient approach with the highest accuracy for multi-view dense matching based on a graph network, three kinds of matching configurations were designed and tested, as illustrated in Figure 7.

Multi-View DIM Results of the Proposed Method Using the Zürich Dataset
To validate the effectiveness and geometric accuracy of the proposed algorithm, the dense point cloud of image-954 of the Zürich dataset is shown together with its height map and accuracy map produced using first configuration, generated according to the average reprojection error during the triangulation process, shown in Figure 8. Figure 8 shows that the proposed algorithm is effective for multi-view dense matching with high completeness and precision; the RMSE of average projection error was around 0.60 pixel.Furthermore, it can be seen that most of the points with large error are located on the edge of tall buildings, areas of vegetation, and occluded areas as seen in the accuracy map.This result was consistent with the results reported in the literature [13,29].For these areas with large error, dense image matching often Figure 8 shows that the proposed algorithm is effective for multi-view dense matching with high completeness and precision; the RMSE of average projection error was around 0.60 pixel.Furthermore, it can be seen that most of the points with large error are located on the edge of tall buildings, areas of vegetation, and occluded areas as seen in the accuracy map.This result was consistent with the results reported in the literature [13,29].For these areas with large error, dense image matching often produces results with incorrect disparity or without disparity because there are no correspondences or because there is a great deal of noise, which results in lower confidence.These sources of error will be eliminated during the triangulation process.
To evaluate the efficiency and accuracy of different matching configurations, the number of stereo pairs and points (point cloud of image-954), the processing time, and the RMSE of the average reprojection error were used for comparison; these results are shown in Table 2. Table 2 shows that integration of stereo pairs with lower overlap; whether in flight or across flights during MVS processing for the current algorithm, has little effect on the number of reconstructed points.The accuracy slightly decreased when the number of stereo pairs Typically, in aerial photogrammetry, a larger base-height ratio will yield higher intersection accuracy.In a dense matching situation, increasing the base-height ratio also means more occluded areas between stereo pairs, and significantly influences the quality of the SGM dense matching results.Points in occluded areas often produce matching results with more noise, which leads to low accuracy during triangulation process.Meanwhile, correspondence in image pairs outside the candidates selected during the construction stage of the graph network can be matched using the connections in the graph network, resulting in more observations for triangulation and therefore, higher accuracy.
Furthermore, when compared with the third configuration adopted in SURE and PhotoScan, the total processing time was reduced by 57% using the first configuration, as recommended in this paper.The number of images however was small and the overlap relationships were simple in the test dataset, nevertheless the total runtime can be reduced by multiples when applying the proposed algorithm to scenes with larger amount of images.Increasing the in-flight overlap with the minimum required overlap across flights for aerial photogrammetry balances between reconstruction accuracy and efficiency of field operations.
According to the method presented in Section 2.4, the multi-view point clouds within the test area were fused and resampled with the average reprojection error as a priority measure, the results are shown in Figure 9.
Figure 9 shows that the presented strategy of multi-view point cloud fusion is effective, with the RMSE of average projection error of the whole target area at around 0.55 pixel, which is superior to the single view accuracy.The proportion of points with large error was reduced when compared to single view, as evident in the accuracy map from the histogram of reprojection error.During the fusion process, points in the neighborhood with the minimum average reprojection error are reserved as resampled points, which can reduce the difficulty and complexity of post-processing and further improving the accuracy of the final point cloud.This approach can be easily extended to oblique photogrammetry, where the dense point clouds from different directions must also be fused.

Robustness for Matching in Difficult Areas
Regions with sharp discontinuities, weak textures or repeated textures like tall buildings, and areas of vegetation, are considered difficult areas in dense matching [13,29].Complete and accurate matching results for those kinds of regions can be obtained using the proposed algorithm as illustrated in Figure 10.Due to the introduction of guided median filter in tSGM algorithm, sharp edge characteristics of the target can be maintained, which ensures accurate dense matching results at sharp discontinuities.Reasonable disparities in areas with weak textures or repeated textures can be computed according to the surrounding valid pixels by guided median filtering and interpolation.

Robustness for Matching in Difficult Areas
Regions with sharp discontinuities, weak textures or repeated textures like tall buildings, and areas of vegetation, are considered difficult areas in dense matching [13,29].Complete and accurate matching results for those kinds of regions can be obtained using the proposed algorithm as illustrated in Figure 10.Due to the introduction of guided median filter in tSGM algorithm, sharp edge characteristics of the target can be maintained, which ensures accurate dense matching results at sharp discontinuities.Reasonable disparities in areas with weak textures or repeated textures can be computed according to the surrounding valid pixels by guided median filtering and interpolation.

Robustness for Matching in Difficult Areas
Regions with sharp discontinuities, weak textures or repeated textures like tall buildings, and areas of vegetation, are considered difficult areas in dense matching [13,29].Complete and accurate matching results for those kinds of regions can be obtained using the proposed algorithm as illustrated in Figure 10.Due to the introduction of guided median filter in tSGM algorithm, sharp edge characteristics of the target can be maintained, which ensures accurate dense matching results at sharp discontinuities.Reasonable disparities in areas with weak textures or repeated textures can be computed according to the surrounding valid pixels by guided median filtering and interpolation.

Accuracy Analysis with Reference ALS Data
Since the actual positioning accuracy is affected by both matching accuracy and base-height ratio, reprojection error cannot fully evaluate the result of dense matching, so the Vaihingen dataset with reference ALS point cloud was used to validate the RMSE for positioning accuracy in the target area with vegetation and buildings.The dense matching results produced by the tested three configurations are compared to the reference data.Deviation maps were generated, as shown in Figure 11. Figure 11 shows that the dense matching results from the proposed method were consistent with the reference ALS point cloud.The actual positioning accuracy was 1.50 GSD, 1.75 GSD and 1.87 GSD for three configurations, respectively; similar to the accuracy as determined using reprojection error

Accuracy Analysis with Reference ALS Data
Since the actual positioning accuracy is affected by both matching accuracy and base-height ratio, reprojection error cannot fully evaluate the result of dense matching, so the Vaihingen dataset with reference ALS point cloud was used to validate the RMSE for positioning accuracy in the target area with vegetation and buildings.The dense matching results produced by the tested three configurations are compared to the reference data.Deviation maps were generated, as shown in Figure 11.

Accuracy Analysis with Reference ALS Data
Since the actual positioning accuracy is affected by both matching accuracy and base-height ratio, reprojection error cannot fully evaluate the result of dense matching, so the Vaihingen dataset with reference ALS point cloud was used to validate the RMSE for positioning accuracy in the target area with vegetation and buildings.The dense matching results produced by the tested three configurations are compared to the reference data.Deviation maps were generated, as shown in Figure 11. Figure 11 shows that the dense matching results from the proposed method were consistent with the reference ALS point cloud.The actual positioning accuracy was 1.50 GSD, 1.75 GSD and 1.87 GSD for three configurations, respectively; similar to the accuracy as determined using reprojection error Figure 11 shows that the dense matching results from the proposed method were consistent with the reference ALS point cloud.The actual positioning accuracy was 1.50 GSD, 1.75 GSD and 1.87 GSD for three configurations, respectively; similar to the accuracy as determined using reprojection error as discussed in Section 3.2.1.Comparing the positioning error maps of the three configurations, configuration 2 was less accurate than configuration 1 due to the impact of the low matching accuracy of wide-baseline image pairs using the same image collection for triangulation.On the other hand, image pairs across flights were added in configuration 3 for triangulation, which was more accurate in some local areas than configuration 1 as shown in the area denoted by a yellow ellipse.However, the matching accuracy in the areas with sharp discontinuities like tall buildings was further reduced due to the big intersection angle between image pairs across flights, this resulted in less accurate results than configuration 1 , which considered completeness and overall accuracy.

Comparison with SURE and PhotoScan
Since our method is close to SURE, the point clouds obtained the two methods were first compared through visual inspection.Figure 12 (red ellipse areas) displays the results for the Zürich dataset.Ghosting effects are visible in the SURE results, while the results produced using our proposed method show no evidence of Ghosting.In the SURE method, point clouds for the same scene are generated from different views during the triangulation process with different correspondences.This results in a slight offset across the multi-view point clouds.In contrast, since a graph network is used in our algorithm, the point clouds from different views share the same correspondences with no offset and can be seamlessly integrated.
Remote Sens. 2016, 8, 799 13 of 18 as discussed in Section 3.2.1.Comparing the positioning error maps of the three configurations, configuration ② was less accurate than configuration ① due to the impact of the low matching accuracy of wide-baseline image pairs using the same image collection for triangulation.On the other hand, image pairs across flights were added in configuration ③ for triangulation, which was more accurate in some local areas than configuration ① as shown in the area denoted by a yellow ellipse.However, the matching accuracy in the areas with sharp discontinuities like tall buildings was further reduced due to the big intersection angle between image pairs across flights, this resulted in less accurate results than configuration ①, which considered completeness and overall accuracy.

Comparison with SURE and PhotoScan
Since our method is close to SURE, the point clouds obtained by the two methods were first compared through visual inspection.Figure 12 (red ellipse areas) displays the results for the Zürich dataset.Ghosting effects are visible in the SURE results, while the results produced using our proposed method show no evidence of Ghosting.In the SURE method, point clouds for the same scene are generated from different views during the triangulation process with different correspondences.This results in a slight offset across the multi-view point clouds.In contrast, since a graph network is used in our algorithm, the point clouds from different views share the same correspondences with no offset and can be seamlessly integrated.To further validate the performance and accuracy of the proposed method, two state-of-the-art software product (SURE and PhotoScan) results are compared at three target areas from easy to challenging in the Vaihingen dataset.Target-1 is a planar playground, target-2 is a single building and target-3 is a continuous complex building.Affected by the plant growth due to time gap between image capture and LiDAR flights, and given differences in measurement principles, the reference ALS data cannot exclusively match the DIM results.Areas of vegetation therefore, were not used in these experiments.
To make a fair comparison, the three methods are excuted at the same image resolution using configuration three, which is adopted as the solution in SURE and PhotoScan.In order to reduce the requirements for professional skills, SURE and PhotoScan are designed ina one-button processing mode, so the matching results are generated with default settings.In this paper, point clouds obtained from the three methods were first resampled with the grid width of the GSD for the respective imagery to get the same density, so points with the median elevation were retained when the neighborhood of a target grid point contained matching results.Then, deviation maps were generated by computing To further validate the performance and accuracy of the proposed method, two state-of-the-art software product (SURE and PhotoScan) results are compared at three target areas from easy to challenging in the Vaihingen dataset.Target-1 is a planar playground, target-2 is a single building and target-3 is a continuous complex building.Affected by the plant growth due to time gap between image capture and LiDAR flights, and given differences in measurement principles, the reference ALS data cannot exclusively match the DIM results.Areas of vegetation therefore, were not used in these experiments.
To make a fair comparison, the three methods are excuted at the same image resolution using configuration three, which is adopted as the solution in SURE and PhotoScan.In order to reduce the requirements for professional skills, SURE and PhotoScan are designed ina one-button processing mode, so the matching results are generated with default settings.In this paper, point clouds obtained from the three methods were first resampled with the grid width of the GSD for the respective imagery to get the same density, so points with the median elevation were retained when the neighborhood of a target grid point contained matching results.Then, deviation maps were generated by computing the euclidean distance using reference data.The RMSE and mean error were also calculated along with the correctness and completeness rates that represent the percentage of pixels with an error <n × GSD, as shown in Figures 13 and 14.Correctness was computed with the value of three GSD, completeness with ten GSD. the euclidean distance using reference data.The RMSE and mean error were also calculated along with the correctness and completeness rates that represent the percentage of pixels with an error <n × GSD, as shown in Figures 13 and 14.Correctness was computed with the value of three GSD, completeness with ten GSD.To compare the performance of three methods in flat areas, target-1 was selected for the simple case, as shown in the top row of Figure 13.These results show that the proposed method is comparable to PhotoScan, but better than SURE, considering the RMSE and mean error.SURE has a large deviation in the contour region, #1 for a runway, while areas with large deviations in the results from the proposed method are mainly located in areas of cast shadows, #2.Because the image intensity values in those regions are inconsistent with the surrounding areas, the cencus cost and guided median filter are not viable, so the match fails.In the more difficult cases, target-2 and target-3 were selected to evaluate performance on buildings in urban areas.From the deviation maps of target-b as depicted in the middle of Figure 13, the results are similar to those obtained for target-1 by the RMSE and mean error.This structure was effectively captured by all three methods.However, it can be seen that the proposed method works best on the edge details of #3 showing a building covered by a cast shadow.The most challenging case in target-3, where buildings are of different heights with more occlusionsis presented at the bottom of Figure 13, PhotoScan no longer the delivers the best performance due to missed matches on the both ends of the buildings in region #4 and #5.In contrast to PhotoScan, the proposed method and SURE demonstrate more complete results.However, all the solutions from three methods appear relatively noisy in the outlined areas of buildings.Considering the correctness rate as dipicted on the left of Figure 14, the statistics are consistent with the previous analysis.As for the completeness rate as shown in right of Figure 14, all three methods perform well and have a completeness rate of over 95%.

Strengths and Limitations
A graph network-based multi-view dense matching algorithm is proposed in this study.The new method was successfully applied for dense point cloud generation from multi-view high-resolution aerial images.The main advantage of the proposed approach is improvement in the efficiency and completeness for stereo pair based multi-view dense matching with high accuracy.Experimental results show that the average reprojection error is at the subpixel level, and actual positioning accuracy is about 1.5 GSD.
The time consumed by dual-depth map generation for a single stereo pair in 15 Megapixel was about 2 min in our tSGM implementation.Fewer stereo pairs need to be computed in the proposed method than in the conventional practices found in the SURE and PhotoScan software, which decreased the overall processing time by 57% in our experiment with the Zürich dataset.In the selected aerial scenario, the overlap ratio was low and the number of images was small.For large scale scenarios with high overlap, the execution time could be further reduced, thereby considerably improving the efficiency and applicability of the MVS algorithm.
Within multi-view stereo methods based on image space, disparity estimations of single stereo models are typically merged in order to increase the reliability and precision of the final depth maps or point clouds.The original SGM [3] proposes an orthographic 2.5D projection method to fuse several disparity images from different viewpoints for aerial applications, and SURE [5] presents a method to transform all disparity maps of base image to the distance maps between camera center and object point on the optical ray, then fuses them by minimizing the reprojection error.However, point clouds from different base images will have small offsets since in these methods they are generated separately without connection.Those produced by the proposed method show higher consistency, therefore these point clouds can be seamlessly integrated without extra processing.
Considering that the performance of stereo matching decreases sharply with the increased base-height ratio using a SGM algorithm, image pairs with wide baselines can obtain higher correspondence using the proposed method than the direct matching result.Our experimental results show that configuration 1 has a higher performance than configuration 2 and 3 .The proposed method performs better than traditional practices found in SURE and PhotoScan since the point clouds from different base images have consistent coordinates, with competitive efficiency.
The proposed method can be combined with an object-based MVS algorithm such as the plane sweep stereo method [30,31] to achieve the optimal performance and efficiency.A minimum number of image pairs are computed to obtain initial matching results with high accuracy, so that the depth search space can be constrained in a small range for the object-based MVS method to refine the results.

Conclusions
A novel graph network based multi-view dense matching algorithm for high-resolution aerial images is proposed in this paper.In this method, a modified tSGM algorithm is first used to compute dual direction disparity maps across stereo pairs, and point clouds of single view are then generated by triangulation based on graph network, followed by multi-view point cloud fusion with average reprojection error as a priority measure.
The aerial image dataset of Vaihingen and oblique nadir image block of Zürich from the ISPRS test project are used to verify the and accuracy of the proposed method.Experimental results demonstrate the completeness, efficiency, and high precision of this algorithm as results at the sub pixel level can be achieved.To explore the most efficient approach for multi-view dense matching based on a graph network, three kinds of configurations were designed and tested.Considering the numbers of image pairs, runtime and accuracy, our comparative experiments confirm that the first configuration is superior to the other two configurations as adopted in the SURE and PhotoScan approaches.Extensive experiments show that the proposed method can derive results comparable to SURE and PhotoScan, with no ghosting and improves the accuracy of the final point cloud and can be extended to oblique photogrammetry.Furthermore, experiments verified the effectiveness of this strategy for multi-view point cloud fusion and difficult areas with sharp discontinuities, weak textures, or repeated textures can be matched by the proposed algorithm.
This research provides a good foundation for oblique multi-view dense matching; our future research will focus on further improvements to the efficiency and accuracy of the proposed algorithm.Given that random noise of final point cloud may still exist, we will attempt to exploit a post-processing process to refine the results to make it smoother.

Figure 1
Figure 1 illustrates the process for constructing the graph network.Let the target graph network be denoted by G then, nodes represent the available images, arcs represent stereo matching from a start node to end node, and paths represent correspondence linking between two images, as shown in the Figure.All the oriented images are first added to the G as single nodes respectively, while arcs are inserted to G by candidate stereo pairs.

Figure 1 Figure 1 .
Figure1illustrates the process for constructing the graph network.Let the target graph network be denoted by then, nodes represent the available images, arcs represent stereo matching from a start node to end node, and paths represent correspondence linking between two images, as shown in the Figure.All the oriented images are first added to the as single nodes respectively, while arcs are inserted to by candidate stereo pairs.

Figure 2 .
Figure 2. Two definitions of the overlap ratio: on the left, the definition in an aerial situation; and on the right the overlap ratio as defined for general purposes.

Figure 1 .
Figure 1.The schematic diagram of multi-view graph network structures.A node represents images of the scene.Arc represents stereo matching from start image i to end image j with bound disparity map D ij .Path represents correspondence linking between two images.

Figure 1 .
Figure 1.The schematic diagram of multi-view graph network structures.A node represents images of the scene.Arc represents stereo matching from start image i to end image j with bound disparity map .Path represents correspondence linking between two images.

Figure 2 .
Figure 2. Two definitions of the overlap ratio: on the left, the definition in an aerial situation; and on the right the overlap ratio as defined for general purposes.

Figure 2 .
Figure 2. Two definitions of the overlap ratio: on the left, the definition in an aerial situation; and on the right the overlap ratio as defined for general purposes.

Figure 3 .
Figure 3.The dense matching results of different processing stages for a local area (Column 1: original stereo pair, Column 2: cost aggregation results, Column 3: speckle-filtering results, Column 4: above shows the results from guided median filtering and below shows the results from traditional median filtering).

Figure 3 .
Figure 3.The dense matching results of different processing stages for a local area (Column 1: original stereo pair, Column 2: cost aggregation results, Column 3: speckle-filtering results, Column 4: above shows the results from guided median filtering and below shows the results from traditional median filtering).

Figure 4 .
Figure 4. Stereo dense matching results for epipolar images (From left to right: left epipolar, left disparity, right disparity, and right epipolar images).

Figure 5 .
Figure 5.The geometric correspondence relationship across multi-view images based on a graph network.First, a pixel on base image can be transformed to left epipolar image by homographic matrix.Then, the correspondence on the right epipolar image can be calculated using the disparity image of the base image.Finally, a pixel on a matched image can be obtained by multiplying inverted homographic matrix .

Figure 4 .
Figure 4. Stereo dense matching results for epipolar images (From left to right: left epipolar, left disparity, right disparity, and right epipolar images).
Remote Sens. 2016, 8, 799 6 of 18 graph network, the depth maps of the reference image and matched image are both computed and output for subsequent processing, as shown in Figure4.

Figure 4 .
Figure 4. Stereo dense matching results for epipolar images (From left to right: left epipolar, left disparity, right disparity, and right epipolar images).

Figure 5 .
Figure 5.The geometric correspondence relationship across multi-view images based on a graph network.First, a pixel on base image can be transformed to left epipolar image by homographic matrix.Then, the correspondence on the right epipolar image can be calculated using the disparity image of the base image.Finally, a pixel on a matched image can be obtained by multiplying inverted homographic matrix .

Figure 5 .
Figure 5.The geometric correspondence relationship across multi-view images based on a graph network.First, a pixel on base image can be transformed to left epipolar image by homographic matrix H l .Then, the correspondence on the right epipolar image can be calculated using the disparity image of the base image.Finally, a pixel on a matched image can be obtained by multiplying inverted homographic matrix H r .

Figure 6 .
Figure 6.The coverage and overlap relationships of test image datasets: (a) Zürich dataset (red rectangle for test area); (b) Vaihingen dataset.

Figure 6 .
Figure 6.The coverage and overlap relationships of test image datasets: (a) Zürich dataset (red rectangle for test area); (b) Vaihingen dataset.

1 2 3 VFigure 7 .
Figure 7. Three kinds of matching configurations for multi-view dense matching based on a graph network: ① considers only adjacent image pairs in flight ② considers all possible image pairs with overlap, in flight ③ considers overlapping image pairs both in flight and across flights.

Figure 8 .
Figure 8.The dense matching results of image-954.(Top left: the colorized point cloud, Bottom left: the height map of the point cloud, Top right: the accuracy map rendered by average reprojection error, Bottom right: the histogram for the reprojection error).

Figure 7 .
Figure 7. Three kinds of matching configurations for multi-view dense matching based on a graph network: 1 considers only adjacent image pairs in flight 2 considers all possible image pairs with overlap, in flight 3 considers overlapping image pairs both in flight and across flights.

1 2 3 VFigure 7 .
Figure 7. Three kinds of matching configurations for multi-view dense matching based on a graph network: ① considers only adjacent image pairs in flight ② considers all possible image pairs with overlap, in flight ③ considers overlapping image pairs both in flight and across flights.

Figure 8 .
Figure 8.The dense matching results of image-954.(Top left: the colorized point cloud, Bottom left: the height map of the point cloud, Top right: the accuracy map rendered by average reprojection error, Bottom right: the histogram for the reprojection error).

Figure 8 .
Figure 8.The dense matching results of image-954.(Top left: the colorized point cloud, Bottom left: the height map of the point cloud, Top right: the accuracy map rendered by average reprojection error, Bottom right: the histogram for the reprojection error).

Figure 9 .
Figure 9.The fused result of multi-view point clouds are presented (Top left: the colorized point cloud, Bottom left: the height map of point cloud, Top right: the accuracy map of average projection error, Bottom right: the histogram of reprojection error).

Figure 9 .
Figure 9.The fused result of multi-view point clouds are presented (Top left: the colorized point cloud, Bottom left: the height map of point cloud, Top right: the accuracy map of average projection error, Bottom right: the histogram of reprojection error).

Figure 9 .
Figure 9.The fused result of multi-view point clouds are presented (Top left: the colorized point cloud, Bottom left: the height map of point cloud, Top right: the accuracy map of average projection error, Bottom right: the histogram of reprojection error).

Figure 10 .
Figure 10.Multi-view dense matching results for difficult areas with sharp discontinuities, weak textures, or repeated textures.

Figure 11 .
Figure 11.The dense matching results for the target area in the Vaihingen dataset: (a) The color image of the target area; (b) The height map of the reference ALS point cloud; (c) The height map of the DIM result from the proposed method; and (d-f) represent the distance deviation maps for the DIM results from the three tested configurations.

Figure 10 .
Figure 10.Multi-view dense matching results for difficult areas with sharp discontinuities, weak textures, or repeated textures.

Figure 10 .
Figure 10.Multi-view dense matching results for difficult areas with sharp discontinuities, weak textures, or repeated textures.

Figure 11 .
Figure 11.The dense matching results for the target area in the Vaihingen dataset: (a) The color image of the target area; (b) The height map of the reference ALS point cloud; (c) The height map of the DIM result from the proposed method; and (d-f) represent the distance deviation maps for the DIM results from the three tested configurations.

Figure 11 .
Figure 11.The dense matching results for the target area in the Vaihingen dataset: (a) The color image of the target area; (b) The height map of the reference ALS point cloud; (c) The height map of the DIM result from the proposed method; and (d-f) represent the distance deviation maps for the DIM results from the three tested configurations.

Figure 12 .
Figure 12.Multi-view dense matching results from the proposed method and SURE (From left to right: Origin images, matching results from the proposed method, and matching results from SURE).

Figure 12 .
Figure 12.Multi-view dense matching results from the proposed method and SURE (From left to right: Origin images, matching results from the proposed method, and matching results from SURE).

Figure 13 .
Figure 13.The deviation maps for the three target areas from the proposed method, PhotoScan, and SURE (Column 1: the original color images, Column 2: deviation maps of PhotoScan method, Column 3: deviation maps of SURE method, and Column 4: deviation maps of the proposed method).

Figure 13 .
Figure 13.The deviation maps for the three target areas from the proposed method, PhotoScan, and SURE (Column 1: the original color images, Column 2: deviation maps of PhotoScan method, Column 3: deviation maps of SURE method, and Column 4: deviation maps of the proposed method).

Figure 14 .
Figure 14.The statistics of correctness rate and completeness rate in target areas.Figure 14.The statistics of correctness rate and completeness rate in target areas.

Figure 14 .
Figure 14.The statistics of correctness rate and completeness rate in target areas.Figure 14.The statistics of correctness rate and completeness rate in target areas.

Table 1 .
Data description of test image datasets.

Table 1 .
Data description of test image datasets.

Table 2 .
Comparative results for all three configurations.