Building Façade Recognition Using Oblique Aerial Images

This study proposes a method to recognize façades from large-scale urban scenes based on multi-level image features utilizing a recently developed oblique aerial photogrammetry technique. The method involves the use of multi-level image features, a bottom-up feature extraction procedure to produce regions of interest through monoscopic analysis, and then a coarse-to-fine feature matching strategy to characterise and match the regions in a stereoscopic model. Feature extraction from typical urban Manhattan scenes is based on line segments. Windows are re-organised based on the spatial constraints of line segments and the homogeneous structure of the spectrum. Façades as regions of interest are successfully constructed with a remarkable single edge and evidence from windows to get rid of occlusion. Feature matching is hierarchically performed beginning from distinctive facades and regularly distributed windows to the sub-pixel point primitives. The proposed strategy can effectively solve ambiguity and multi-solution problems in the complex urban scene matching process, particularly repetitive and poor-texture façades in oblique view.


Introduction
Information about the building faç ade is key in the field of building modeling, landmark recognition, navigation and scene understanding and other outdoor urban environment related applications.The faç ade is always extracted based on ground data from on-board cameras [1] or terrestrial laser scanners [2].Airborne platforms have the advantages of low cost, high efficiency, wide coverage and extensive applicability.The oblique aerial images captured by recently developed airborne oblique photogrammetry [3,4] simultaneously acquire both rooftop and faç ade information and thus provide numerous advantages for faç ade-related applications in remote sensing compared with traditional systems in the vertical imaging view.
Since the development of airborne oblique photogrammetry, the visualization and texture mapping of faç ade information based on aerial oblique images [5][6][7] has been applied extensively, but recognition-based applications have only been explored.On one hand, facade could be segmented and confirmed in a 2D image space.Lin and Nevatia [8] detected faç ades utilizing line segment detection and perpetual organization based on a single oblique image.Perpetual organization, which is a common approach in the detection of rectangular building roofs, appeared to be suitable but was actually unfeasible: the oblique view leading to remarkable occlusion and feature detection due to inconspicuous spectral differences between adjacent facades makes perpetual organization invalid.The edge between faç ade and ground surface is usually shielded and the vertical edges between the adjacent faç ades are hard to detect owing to spectral similarity.On the other hand, more researchers attempted to consider facade in 3D stereo vision space.Xiao et al. [9] extracted buildings and reconstructed a simple building model based on recognizing vertical faç ade planes.The 2D characteristics of remarkable linear horizontal and vertical structures and the 3D characteristics of significant changes in height in a pair of input images were used to implement the faç ade reconstruction.The faç ade was hypothetically generated from a monocular image and verified with a stereo image pair.Nyaruhuma et al. [10] verified building outlines in 2D cadastral datasets based on determining the spatial location of the faç ade.Five factors, including line match, line direction, correlation coefficient, sift match and building edge ratios, were employed to confirm the locations of the vertical faç ades in 3D space based on the 3D point cloud data from stereo matching.Meixner and Leberl [11] detected a faç ade mapped in 3D object space based on a 3D point cloud using vertical aerial photography (greater than 20°) instead of utilizing various auxiliary data.Zebedin et al. [12] adopted an image optimization-based method to ascertain the positions of faç ades in Digital Surface Model, which was employed to initialize the hypotheses and describe the 3D information of the faç ades.
The currently continuous imaging with large overlaps makes stereo matching based methods feasible for faç ade information extraction.Matching techniques can be divided into two categories: area-based matching and feature-based matching.Area-based matching is a pixel-wise dense matching technique in which the centre of a small window is matched by statically comparing windows of the same size in the reference image and target image.Feature-based matching measures the comparability of an obvious feature based on an invariance principle; that is, the extraction, description and measurement of the feature determines the matching results.Feature matching is divided into multiple levels based on the information content of the "interesting" parts of images: local low-level features (points and lines), regional mid-level features, such as regularly shaped structures, and regional high-level features that describe a specific object.
The oblique aerial photography technique produces a new data resource for faç ade-related issues, but it has to develop adaptive matching algorithms to address the specific characterizations of large-scale urban oblique aerial imagery, such as depth discontinuities, occlusions, shadows, low texture and repetitive pattern.Terrain discontinuities invalidate the surface constraints.Required information is lost in areas with occlusions, and shadows can result in spectrum information confusion with normal regions.The repeated structures may produce multiple peaks, which result in high error matching probabilities.Areas with poor texture are vulnerable to mismatching or non-matching.These problems inevitably result in uncertainty and ambiguity and thus render area-based and local feature matching techniques problematic and invalid in oblique images that cover large areas.
Fortunately, the remarkable linear structure of a building faç ade makes both the corner points and straight lines stand out.A building faç ade that is based on a Cartesian coordinate system and whose prominent structures are orthogonal with one another [13] allows for the easy application of geometric constraints on the line primitives; that is, lines should be either parallel or orthogonal to one another.If the classical line features that are obtained from windows or faç ade edges can be sorted and re-grouped, the processing of individual line segments becomes the processing of groups of line segments; thus, more geometric information is available for disambiguation in the recognition and matching process.Meanwhile, several studies have used a compromise that involves close cooperation between monoscopic and stereoscopic analyses to first produce areas of interest in each image and then characterize and reconstruct these areas into stereo images.Noronha and Nevatia [14] constructed 3D models of rectilinear buildings from multiple aerial images.Hypotheses for rectangular roof components were generated by hierarchically grouping lines in the images and then matching them in successive stages.
In summary, our aim is to recognize and obtain the spatial information of building facades, allowing to deal with widely covered areas, which is one of the biggest challenges in using oblique aerial photogrammetry.We mine image feature information ranging from simple to complex structures.Moreover, the faç ade and the microstructures are effectively detected and matched step by step.We first generate faç ade (or other objects of interest) hypotheses and then verify them over wide areas.We mine the image feature information ranging from simple to complex structures.The faç ade and microstructures are effectively detected and matched step by step, and a novel three-layer approach is used for faç ade reconstruction using information from simple to complex image structures.Simple structures are specific local features in the image itself, such as points or edges that are directly detected by a corner-or linear-feature-detecting algorithm.Complex structures include mid-or high-level regional features, such as windows and faç ades that are constructed from low-level features.
The remainder of the paper is structured as follows.The proposed feature extraction and matching methodology is introduced in Section 2. Experimental performance evaluations are presented in Section 3, and Section 4 provides the conclusions of the study.

Methodology
The methodology is presented schematically in Figure 1.Experimental data are described in Section 2.1.The proposed approach consists of a multi-level feature extraction procedure (from low-level local features to high-level regional features), which is introduced in Section 2.2, and a coarse-to-fine hierarchical feature matching procedure (backwards from regional features to local features), which is presented in Section 2.3.

Spatial constraints
Window-feature facade-outline regional feature description The feature extraction was performed in three parts.First, straight-line segments were extracted as low-level features from oblique images using a fast and accurate line segment detector.Second, the discrete line segments were grouped as parallelogram window regional features (mid-level features) using spatial constraint analysis with fuzzy production rules and a "take the best" strategy.Third, the high-level features (façade regional features) were created by combining the existing window regional features, including both candidates and fakes, and an outline re-organized by a set of segments because of image noise, occlusions and deficiencies in the line extraction algorithm.Assuming that the façade is rectangular or is composed of several rectangles, the most robust edge of the façade, the border between the rooftop and the façade, was reorganized by a line linking process.The other three vertical corner lines were ignored.Thus, the spatial cluster succeeded in grouping the windows and extended the edge to the integrated outline.X-corners, and especially the angular points of windows, are widely distributed in façade scenes and serve as high-precision correspondences in the final matching process to fit the 3D façade information.

Corners
The line segments were constructed into the window regional features and the façade regional features, which decreased the location accuracy but improved the distinguishability, so the backwards coarse-to-fine strategy is feasible and applicable.
The feature matching was based on three phases in which the accuracy increases with each procedure.In the initial matching phase, we employed the regional feature description to evaluate the similarity of the façade regional features.The façades with high similarity were defined as seeds and set as reference objects.The relative location information, which was parameterized by the angle and distance eigenvectors between the non-matched and reference façades, was used to "densify" the correspondences via graph propagation.In the next matching phase, an adjacent matrix of the window network graph was weighted to authentically express the distribution of windows in the façade, and an iterative improvement was applied to acquire a rough matching result.The matched window regional feature that was formed by the original line segments has pixel-level accuracy compared to the former approximate regions.A third matching phase was implemented to improve the precision of the geometric measurements so that sub-pixel point features could be detected and matched.The initial match reduces the search range to the façade patch, and the next match obtains a set of correspondences to estimate the transform model between each façade pair.A spatial distance measure was defined to express the degree of matching between the sub-pixel points in the reference image and the points transformed from them in the search image, and the minimum distance rule with a threshold limitation was adopted to determine if a correspondence exists.Moreover, several improved strategies, such as symmetric processing and Random Sample Consensus (RANSAC) elimination, were applied.

Materials
Oblique images were captured by SWDC-5 oblique aerial photogrammetry system composed of five Hasselblad H3D cameras [15] (Figure 2), one of them is vertically oriented and others are tilted with an angle of 45°.There is also a navigation and positioning system (composed of IMU and GPS) installed on the cameras, so the angle of tilt and the exterior orientation of images can be obtained when the camera is imaging.The ground resolution of the images is approximately 10 cm when the flying height is approximately 850 m.The proposed method was conducted in the area in Yangjiang, China, using the successive oblique images from the camera in left orientation (Figure 3).Omitting some façades that are almost invisible because of occlusion or the view angle, 127 façades from 108 buildings should be detected and matched based on the proposed approach.Both qualitative and quantitative evaluation criteria are provided based on the manually counted number of façades and the assumption that the flat façades are vertical.

Feature Detection
An important issue in feature-based image matching is how to extract features to effectively describe the original image.This section presents a multi-level feature extracting approach.Line segments, which are the lowest level feature, were directly extracted previously.At the next level, the window regional feature was constructed using more information about the interrelationships.At the highest level, the façade regional feature was generated as the object of interest for the reconstruction.

Line Segment Extraction
The EDLines algorithm, which was proposed by Akinlar and Topal [16], was applied to detect the straight line segments.The algorithm is suitable for automatic high-precision matching because the linear time line segment detector requires no parameter tuning, produces continuous, clean and accurate results, and controls false linear features well.Moreover, the algorithm is more robust against scale and view variations, in which the scale in an image varies and considerable perspective distortion is present in an oblique image, than the typical Hough transformation and LSD [17].

Window Regional Feature Grouping
A window is usually characterised by a quadrilateral shape, which indicates certain spatial constraints for the line segments.As shown in Figure 4, due to the short lengths of the window edges, the edges are well detected in only two cases (existing and absent).In other words, the case in which an edge is divided into a series of fragments can be neglected.A particular solution may directly reorganize several straight line segments into windows.Fuzzy production rules were used to construct candidate sets, and a "take the best" strategy was used to determine the optimal set to build the discrete line segments into a quadrangle that is representative of a middle-level feature structure in complex scenes using spatial constraints, such as direction, distance and topological relations.
For each line segment, the adjacent and parallel edges were searched simultaneously and then integrated.Prior to this procedure, several pre-processes were used to reduce the dimension of the segments.The search space was diminished to segments that belong to the potential blocks with the lowest grey values segmented by the K-Means algorithm (three categories in this paper) and expanded by the dilation of morphology, which attempts to increase the efficiency and eliminate pseudo segments.The detailed implementation of the grouping is discussed in the following paragraphs, and the processing flow is shown in Figure 5.

Searching for Candidate Adjacent Segments
For each reference line segment li, we may obtain several candidate segments(Figure 6), which are defined as set Ai and are judged by distance (δ  ) and direction (δ  ) constraints based on Equation (1).
Let α denote the angle between adjacent edges of the windows.
where      represents the minimum distance between the endpoints of l i and l j ,      represents the angle between the lines, and α is a variable that is estimated based on the position on the image plane.The candidate segments in Ai were screened to a set Ci consisting of certain candidate subsets via intersection judgment and direction (δ  ) constraints (Equation ( 2)) (the situation only containing one element is directly excluded) to constitute a U-shaped feature following Equation (2).
where, intersection judgment (intersect( ((),()) ,   )) involves guaranteeing the ipsilateral, and the direction constraint eliminates the pairs with low accuracy.Figure 7 shows the phenomena excluded in the rules.
After several subsets were selected, the best one was determined through the "winner takes all" scheme.For each pair {lN 1 , lN 2 } belonging to   , we can evaluate the score with the following formula and select the lowest (smaller than the threshold) structure as the optimal one for constructing the ultimate U-shaped structure.In the processing, the obtained U-shaped structure is indicated as where,  1↔2 denotes the imaginary parallel edge produced by connecting the endpoints of adjacent edges  1 ,  2 ; ℎ   represents the length of   .

Figure 7.
Producing the candidate subsets.Segment II is belongs to set AI, and the procedure is to judge whether a relevant subset that includes II exists or not.Segment IV is disregarded because it is close to the same endpoint.Segment V is excluded because of its overlarge angular deviation, and Segment V is rejected for intersection judgment.Segment III remains for weight calculation according to the imaginary edge l  Ⅱ Ⅲ .

Search for Candidate Parallel Edges
Based on actual cases, each reference line segment li should have a parallel line.Thus, parallel set Pi was generated using a distance (δ'd) and orientation (δa) constraint (Equation ( 4)) as follows: Generating the Windows Regional Feature The final procedure determines whether the optimal structure exists.A probability analysis (Equation ( 5)) based on a pre-existing U shape for each element of set Pi was applied to select the qualified element, which was then weighted according to Equation (6).Consequently, the segment with the highest score serves as the remainder and results in an integrated window feature.
For arbitrary  0 ∈   , If then However, set Pi may be empty before or after the probability analysis, in which case the window feature is constructed by directly closing the U-shaped polygon.

Parameter Setting in the Processing
In the processes described above, a set of loose thresholds should be selected; otherwise, the probability of missing correct candidates will increase.In our implementation, the thresholds were empirically set to δd = 10, δd' = 100 and δa = 10.The quantity of candidate segments is more important than the quality because the "take the best" strategy significantly selects the optimal segments.A large set of feasible candidates is also helpful.

Selection and Linear Grouping of Façade Edges
The façade edge is usually divided into several broken line segments (Figure 4).We adopted the four organization criteria proposed by Izadi and Saeedi [18] and combined them with local spectral similarity around the line segments to restore the entire edge.We selected the wall-roof edge from the set of linked lines based on the window regional features using the following two steps.

Judging Edge Directions through Cluster Analysis
A certain number of directions (θ1, θ2, θ3, …, θn, where |θ1 − 90| = min|θ − 90|), which represent the vertical or horizontal directions in the object space, can be obtained by a directional cluster analysis of the windows (Figure 8).Considering the geometric properties of perspective projection imaging, the vertical wall edges are mapped into near-vertical directions in oblique images.The near-vertical direction, which includes most of the samples (defined as θ1, generally |θ1 − 90| < 10), corresponds to vertical edges that are ignored by the process; the others correspond to horizontal edges that are retained to extract the wall-roof edges.Clustering is expected to reduce the search space and limit false extractions.

Faç ade Linear Feature Organization
The extraction problem is now reduced to the remaining horizontal segments.We assume that the longer the line is, the higher the probability that it is a façade edge.Based on the image geometry, the windows that belong to a certain façade in the object space remain below the wall-roof edge in the image space.The extraction traverses the lines that remain after the process given by the following rules.
i.If no window regional feature is located between two straight lines, the shorter one is eliminated.ii.If no window regional feature is located below the straight line within a certain range, the line is eliminated.iii.If many windows are located above the straight line within a certain range, the line is eliminated.
After this process, most of the wall-roof edges can be distinguished properly, yet the fragmentation problem still exists because of large gaps that were neglected in the linear grouping.The following rules are intended to identify collinear straight lines. i.
If two lines partially overlap, the lines are replaced by a new line that extends the longer line to fully cover the shorter one (Figure 9 (left)).ii.If two lines are nearly collinear but are far apart from each other and there is another line in the neighbourhood that overlaps both lines, the lines are replaced by a new line that connects the farthest endpoints (Figure 9 (right)).At this point, the edges between the roof and the wall have been restored, although some may be mixed together.Fortunately, this effect is negligible.

Façade Regional Feature Construction Based on Plane Sweeping Methods
The façade regional feature, which is a hypothetical façade, was constructed by incorporating both the wall-roof edges and the window regional features.Each wall-roof edge was swept downward in the clustering direction θ1 until it encountered another straight line or the border of the image.This procedure was used to create a prototype of a corresponding hypothetical façade that contains all of the windows that had been passed by.For each prototype, the vertical edge direction θ1 was revised (to θ) by clustering based on the window edges that belong to the façade.The windows with large deviations were eliminated.Ultimately, the façade feature was generated by the wall-roof edges, the direction θ and the farthest window.

Feature Matching and Reconstruction
The regions of interest for the façade were produced by monoscopic analyses.Next, we characterised and reconstructed the areas.The hierarchical structure, high-level façades of interest, mid-level line set of the windows and low-level primitives of the line segments or points provided several options for feature-based matching.The distinctive façades of interest contain considerable information, but the precision is too low to be reconstructed in the object space.The windows, or the groups of interconnected line segments, are distributed in a regular pattern.However, significant comparability, repeatability and low textures are observed.Hence, distinguishing the correspondences is difficult.Sub-pixel point primitives are desirable and indispensable in man-made scenes but lack a distinctive appearance that can be distinguished by a conventional strategy in cases of repetitive patterns and low-texture environments.
A new hierarchical coarse-to-fine feature-based matching approach is thus proposed to promote the comprehensive utilisation of multi-level image features.The scheme begins with coarse matching of the façade regional features to reduce the search space into separate sub-images.Coarse matching of the window regional features is then implemented to establish transform matrixes between the façade pairs.Finally, fine matching of the sparse point features is implemented to restore the 3D spatial information of the façade.

Coarse Matching Based on the Façade Features
Matching the façade features is the basis of our approach to accurate and robust façade spatial locational information reconstruction in the object space.A region description method was adopted to minimize the influence of inaccurate façade edges; it consists of Euler's number (EN), the ratio of the areas of the windows to the façade (AR) and the grey histogram (H).A seed propagation solution was used in which the matched pair extends from high similarity to low similarity along the topological graph generated by the coordinates of the façade's centroid (c(x, y)).The feature similarities of EN and AR were calculated by |1 − EN1/ EN2| (=E) and |1 − AR1/AR2| (=R), respectively, and that of the histogram was measured by the sum of the chi-square distribution (dchi-square) and intersection (dintersection) [19].The smaller the value is, the higher the similarity is.The highly similar façades were defined as seeds, and the others were retained to determine if they match.

Generating and Matching Seeds
The seed façades, which are the relatively sparse correspondences and are defined by high similarity, are the bases of the subsequent propagating processes.Thus, the accuracy of the pre-matcher should significantly outweigh the quantity to avoid initial errors; that is, only the most reliable façades should be utilized rather than trying to match the maximum number.For one façade    in the reference image (l), the similarity (   ,    ) = E + R) with each façade    of the target image (r) was calculated.
The highest three were set as matching candidates pi {f r 1, f r 2, f r 3}.An element is a matched feature and defined as a seed (d) only if two conditions have been satisfied: the similarity (   ,    ) of the element should be greater than the threshold α, the histogram similarity between the element and the reference façade    is the highest among the three candidates and greater than the threshold β.N matching pairs

{𝑑 𝑛 ⃗⃗⃗⃗ } 𝑛=1
exist after the traversal, and the possibility of a mismatch cannot be avoided.As a classic means to resolve the problem, the RANSAC algorithm was employed to remove the outliers by fitting the underlying fundamental matrix.

Matching Entire Faç ades
The seed façades were used as base stations to describe the other façade features by mutual geometric relationships that were described by the angle and distance creating eigenvector {  ⃗⃗⃗⃗ ,   ⃗⃗⃗⃗ } =1  instead of the previous descriptions, which were limited by inaccurate locations of interest such that the similarity (    ) is actually indistinguishable and invalid.Perspective rectification based on the seeds was performed to reduce the distortion from geometric differences between views.
The "densifying" of the correspondences across the entire area is illustrated below.The newly constructed eigenvector measures the similarity along the topological neighbourhood graph that was constructed by the Delaunay triangulation algorithm according to the centroid coordinates.The propagation, instead of the traversals, is expected to utilize the relative relationship constraints to ensure the matching accuracy.

Coarse Matching Based on the Window Features
The primitive match generates the corresponding segmented façade; thus, the subsequent match can focus attention on interesting parts as sub-images that are independent of the background, although the matched façades cannot obtain precise geographic position information.Point primitives, particularly the rich corner features, seem to be effective and feasible.However, either the repeated structure of the windows or the imprecise elimination of the fundamental matrix causes problems in the point matching because the appearance is not distinctive, and there is no constraint to eliminate the vagueness.In contrast, the groups of line segments (the aforementioned window regional feature) naturally match such that more regional information is available for disambiguation.Although the spectral and geometric characteristics may be similar and indistinguishable and uniqueness becomes a problem, the spatial position and distribution characteristics are well determined.Therefore, the window regional features were quantified by the spatial positions relative to the façades and bridged to build a topological neighbourhood graph.

Construction of a Window Topological Graph
The a-priori knowledge of the grid structure indicates that the windows are generally distributed transversely and longitudinally.Therefore, the neighbourhood graph was generated by connecting the centroids in two directions that were obtained by the directional clustering.The graph is represented by the logical matrix Mm,n=ζ) that was inspired by the regular structure, where a binary variable Xi,j=ζ) indicates whether window ζ is present on the node in row i (1,2,…,m) and column j (1,2,…,n).

Fuzzy Weighting
Two weights were introduced to indicate the spatial position of the window in the façade while avoiding mismatching that is characterized by multi-valued mapping and ambiguity in binary patterns, especially when too few windows are present and they are scattered in space.In other words, the fuzzy adjacent matrix μm,n=ζ) is expected to enhance the uniqueness and validation of the matching.For window ζ, we let ω ζ r and ω ζ c indicate the longitudinal and horizontal positions that correspond to the row and column in the graph, respectively.The distances from the centroid of the window ζ (c(ζ)) to the façade edges δ ζ t, δ ζ l and δ ζ r correspond to the top, left and right edges, respectively.The average length ε of the vertical edges of the windows was used to normalize the parameters to weaken the perspective projection effect.Hence, the weights were evaluated as ω ζ r = δ ζ t/ε and ω ζ c = δ ζ l/δ ζ r.However, ω ζ c will be miscalculated and result in notable mismatches if the endpoints of the vertical edges exhibit significant dislocation.Thus, to estimate the degree of dislocation, the difference μ between the matched façades was calculated by μ = |ratio1(length)-ratio2(length)|, where ratio1 (length) and ratio2 (length) are the ratios of the length of the matched wall-roof edge to the seed façades.When μ>δ (δ = 0.01 in this study), the matched façade is replaced by the closest eligible one to change the weight ω ζ c to ω' ζ c.

Iterated Matching
Given the weighted matrix μm,n=ζ), the next objective is to determine the correspondences.The row and column were matched as separate units that are equivalent to the corresponding node between the graphs.An iterative strategy was used to eliminate mismatches and modify the initial result while avoiding the influence of viewpoint changes and position errors of the façade edges.The weights of row ( r  ) and column ( c  ) are the averages of ω ζ r and ω′ ζ c of all of the relevant elements, respectively.

Fine Matching Based on X-Corner Features
The corresponding windows represent the match between the lines and the X-corners that are naturally obtained by intersections.The intersections are lower in accuracy and fewer in quantity than in direct corner detection based on image grey information by calculating the curvature and gradient, such as the Forstner, Harris and SUSAN methods.However, the matched intersections can be used as approximations for the refined corner feature to avoid mismatches in fine matching that are caused by repetitive patterns and low texture if the camera geometry is estimated in advance.
In this study, the Shi-Tomasi operator was used to extract well-defined X-corner feature points [20].A calculation mechanism related to the peak value position was then used to acquire sub-pixel accuracy from the pixel-level corners to meet the demands of the measurement [21,22].Point-by-point matching between two views is generally a time-consuming task.However, the search space may be reduced by orders of magnitude in the area of the façade of interest.The approximate transformation formula φ π between façades in binocular views was calculated from the initial corresponding intersections using the least squares fitting method.The RANSAC algorithm was used to remove outliers in the fitting procedure.The outliers originated from the incorrectly matched windows and the imprecise intersections.An approximate nearest neighbour search and a symmetric strategy were applied to allow fast and stable matching.

Spatial Localization of the Faç ade
POS data combined with ground control points were employed to generate 3D sparse points using a space intersection, which was shown to be reasonable in the comparative studies of Sukup et al. [23].Once the 3D-structured corners are recovered in the object space, the façades, which are assumed to be planes, can easily be reconstructed by interpolation, and the locations in the object space can be determined.Two direction vectors (the normal vector n and tangent vector t) can be calculated to quantitatively describe the information of the measurements based on 3D linear fitting, which also evaluates the precision of the matching.

Experimental Results and Evaluation
The proposed façade recognition approach was implemented in C++.The results of the 2D façade detection in the image space and the 3D reconstructed information in the object space are presented and evaluated in the following sections.

Failure of Conventional Approach
In the field of feature matching and three-dimensional information fitting, SIFT-based matching algorithm and epipolar-line constraints based method is commonly accepted.Scale Invariant Feature Transform (SIFT) features are widely used to detect and describe point features in image matching.However, the areas of windows and faç ades are characterized as poor-texture, which make it difficult to detect SIFT features robustly.Just as shown in Figure 10, there are few corresponding SIFT features (green cross).Epipolar-line constraint is widely utilized in the point matching method, and it can reduce the image search space to epipolar line from the whole image area on the basis of the robust shi-tomasi corner operator.On one hand, a certain number of originally corresponding points are necessary to build the epipolar line.On the other hand, it is difficult to distinguish the one-to-one corresponding points because of the repetitive structure of windows (just as shown in Figure 11).
Therefore, the proposed multi-level features are developed to resolve the repetitive structure and poor texture phenomenon which is special and remarkable in faç ade-based large-scale urban oblique aerial imagery.

Evaluation of Regional-Feature Detection
The results of detecting façades and windows from the reference image and the search image are shown in Figure 12. (1) The quadrilaterals show the constructed window regional features, and the orange boxes represent false window features that were eliminated by subsequent spatial clustering.The green boxes show the final window features (Figure 12).The window features were extracted well.The spatial distribution is reasonable; most of the visible façades that contain windows were successfully extracted using a certain number of window features (Table 1), so it is feasible to detect façades based on windows.False features are identified by the K-Means spectral clustering analysis to select candidate line segments; thus, K-Means spectral clustering improves façade detection.However, many windows are missing, mainly because of occlusion or small size (Figure 13).Specifically, self-occlusion and emerged occlusion make several windows difficult to image and thus result in feature absence.Most of the small windows are missing because the ED line algorithm operates with minimum length limit (if the image patch is approximately 3000 × 3000, then the minimum length of the line segments is approximately 15 pixels).Nevertheless, the quality is more important than the quantity.The missing windows, which serve as transitional features that help to generate candidate façades and perform fine matching, can usually be ignored as long as a certain number of windows are present in one façade.Note: For the first three columns, the number before the virgule denotes the quantity of faç ades that meet the corresponding extraction ratio, and the number after it represents the number of faç ades constructed.For the last column, the number before the virgule denotes the quantity of faç ades for which no windows are detected, and the number after it represents the number of faç ades not constructed.
(2) Figure 12 shows examples in which the façade boundaries are overlaid on the input images.The detected façades are divided into four types: completely correct detections, partially correct detections, missing detections and incorrect detections.A feature is regarded as completely correct (Figure 14a) if the wall-roof edge and the vertical edges are properly located even though the bottom of the façade may not be imaged because of occlusion.However, in several cases, either inaccurate endpoint locations or an incorrect distribution of windows causes incompleteness (Figure 14b).Nevertheless, the detection is considered to be partially correct because the façade is constructed based on a poorly defined area of interest.Under some circumstances, the windows or wall-roof edge may not be extracted; this results in a missing façade (Figure 14c).Moreover, false wall-roof edges and false window regional features may meet the criteria, which leads to an incorrectly detected façade.Table 2 shows the accuracy assessment at the object level.To quantify the façade extraction results, we adopted three frequently used metrics [24,25].Precision indicates the extent to which the detected façades are at least partially real.Recall is a measure of the omission error.Overall accuracy and F1-score are composite metrics that consider both correctness and completeness.These metrics are computed as follows: where NTP, NFP and NFN represent the number of true positives (i.e., completely and partially correct extracted faç ades), false positives (i.e., incorrectly extracted faç ades) and false negatives (i.e., missing faç ades), respectively.The accuracy of the boundary delineation was qualitatively analysed at the pixel level.The errors mainly originate from the partially correct detections and dislocations of the wall-roof edge.The partially correct detection indicates the existence of the boundary but ignores the inaccuracy.Several line features are present along the wall-roof edge because of eaves or other linear objects (Figure 15).However, only the one with the highest weight remained, which indicates that the weighting schemes lead to different choices.Therefore, the "non-integrity" and non-uniqueness make it problematic to perform the reconstruction based on the façade features and perform façade matching based on edges.

Evaluation of Feature Matching
After extracting the features (e.g., the façade and window regional features and X-corners), we established multi-level feature matches step by step.After determining which correspondences passed on to the next stage for further processing, it is necessary to quantify the performance of the matches.A confusion matrix can be constructed to represent the dispositions of the test set, including a series of common metrics.Commonly used definitions, including counting the number of true and false matches and match failures, were adopted [26].Then, these numbers were normalised into unit rates by following Equation (9).TP: true positives, i.e., the number of correct matches; FN: false negatives, i.e., the number of the matches not found; FP: false positives, i.e., the number of incorrect matches; TN: true negatives, i.e., the number of non-matches that were correctly rejected.(1) The façades were matched by combining the procedures of seed façade selection and propagation.For all of the detected façades, we chose several candidate seeds (25 in the study area) with region-based similarity measures and then removed the outliers (three in the study area) using the RANSAC algorithm (Figure 16).Due to the optimal solution of the seeds' selection, only the outstanding seeds served as base stations, which were exactly matched.The others were then examined to determine if correspondence existed (Figure 17).The results of the façade regional feature matching are evaluated by the confusion matrix and common performance metrics (Table 3).The high accuracy demonstrates that the process gave an effective solution to the problematic inaccuracy of the partially correct detection.Indeed, precise matching is also the basis of effective post-processing.Moreover, although the façade matching rate is 73.2%, the actual rate reaches 81.5% when façades that belong to different buildings are considered.(2) In the matched façades, the windows were matched well utilizing the fuzzy topological neighbourhood graph and the iterative process.The reference and target images contained 828 and 702 windows, respectively, of which correspondences were found for 497 (Figure 18).The relevant confusion matrix and common performance metrics (Table 4) provide a quantitative evaluation of the window regional feature matching.(3) Once the windows were matched, which means that there are series of corresponding intersections, the approximate transformation formula φ π between the corresponding façades can be evaluated separately using the RANSAC algorithm.Then, the refined X-corners in the sample area are matched (Figure 19), which not only improves the accuracy from the pixel-level to the sub-pixel but also increases the number of matching points by almost three-fold.Moreover, façade regions of interest reduce the search space by 2 × 3 orders of magnitude.Approximately, 10 5 X-corners (697,617 and 623,051) are detected in each of the perspectives, whereas only 1010 3 are contained in the corresponding façades.Ultimately, two vectors, a normal vector n and tangent vector t, were fit to express the façade plane, which was also used to quantitatively verify the accuracy of the proposed hierarchical matching approach.
Table 5 compares the results from different corresponding features, sub-pixel X-corners and pixel-wise intersections.The fitted parameters are evaluated with the hypothetical standard of the vertical plane, in which the classified statistics are conducted based on the absolute difference.The fitted vectors based on the X-corners are far more precise, which justifies the reliable and necessary procedure of moving backwards from regional features to local features.

Conclusions
We addressed the problem of large-scale object recognition and spatial localization of urban façades from oblique aerial images.The approach can effectively segment façade areas from oblique aerial images and obtain precise 3D spatial position information.
Multi-level features were extracted using a bottom-up approach.The low-level line segments are well detected and require the construction of an object feature.The mid-level windows are obtained by organization and clustering algorithms and play a transitional role in both façade extraction and restoration.The high-level façades are constructed through a new plane sweeping procedure and serve as areas of interest.The test achieved approximately 90% of the comprehensive F1-score, which indicates that the proposed feature-extracting approach achieved promising results.
The newly proposed coarse-to-fine hierarchical matching approach exhibits refined sub-pixel performance based only on image information.Although most façades are characterized by an uncertain outline, we conclude that the seed propagation algorithm is fairly insensitive; the experimental area is almost completely correct (ACC = 97.5%).Transitional matching was conducted effectively (ACC = 93.2%) on the window features to avoid ambiguity and mismatching caused by repetitive patterns and poor textures if the corresponding point primitives are searched directly.Finally, the intersections succeeded in building the transformation relationship between the façade pairs, and the sub-pixel level corner primitives alternatively fit the 3D vectors of the façade information.
The experiment regards remarkable window features as a transition connection; that is, the validity lies in the existence and extraction of windows that are part of the façade.The approach fails in cases in which there is no evidence of regular windows in the façade.Additionally, we grouped the line primitives into windows but ignored the robust points because of the lack of spatial relationship constraints.In the future, we plan to comprehensively utilize both line segments and X-corners to improve the window extraction rate, which may improve the matching effect.
The experiments were conducted in binocular stereo vision.The study area can also be obtained in more than two successive images, so a multi-view strategy may compensate for the insufficient or unstable extraction of regional features and improve the matching rate.In the case of N views, images can be defined as C  2 binocular tuples; thus, several features may be matched in a certain pair even though they may not appear in all of the images.This would improve the extraction and matching ratio.The images used in this study were captured in one direction.Better results may have been obtained if the images were captured in four directions (N, S, E and W).

Figure 1 .
Figure 1.Workflow of faç ade recognition in the image space and localization in the object space.

Figure 3 .
Figure 3. Images of the study area.The red manually drawn boxes depict the potential faç ades to be processed.Image (a) is used as the reference image (the study area is shown by the green box); and image (b) is used as the search image.(a) Original reference image; (b) Original search image.

Figure 6 .
Figure 6.Finding adjacent line segments.The red line segments II and III are adjacent to line segment I because the end points are within the adjacent regions and the angles are within the ranges.Line segments IV (green) and V (orange) are either too distant or at too high an angle to include, and the purple segments are excluded because they do not meet both conditions.Thus, set AI = {II, III} is obtained.

Figure 8 .
Figure 8. Directional clustering of a sample image.Six main directions are present.The value of approximately 80° is considered to represent the vertical edges.The value of approximately 100°, which may be closer to 90°, is considered to represent the horizontal edge because of the much smaller number of samples.

Figure 10 .
Figure 10.SIFT features detected in binocular vision images.

Figure 11 .
Figure 11.Epipolar-based shi-tomasi point matching.For the corner (green) in reference/left image, the search space of the corresponding point is reduced to the epipolar line (black line in the search/right image).However, there may exist several candidate matching points because of repetitive and poor-texture characterization of the faç ade area.

Figure 13 .
Figure 13.Failed window regional feature extraction.The windows enclosed in red boxes are examples of self-occlusion, and the windows enclosed in blue boxes are examples of emerged occlusion; both lead to feature absence.The windows enclosed in orange boxes are examples of small windows, which also lead to feature absence.

Figure 14 .
Figure 14.Three main types of extracted faç ades.(a) Classical types of completely correct detections, which are similar to reality.The left image shows the ideal case.Other objects are partially blended because of occlusion (middle), and part of the faç ade is not included in the regional feature because the non-obvious window features cause occlusion (right); (b) Several types of non-integrity.Deficiencies or splitting are observed due to non-integrity of the wall-roof edge (left), and the bottom faç ade is missing due to the failure to detect the window regional features (right); (c) Main types of missing faç ades.Arc-shaped faç ades (left) and a non-distinctive edge between the wall and the rooftop (middle) are observed.Nearly all of the window features are not extracted from a faç ade (right).

Figure 15 .
Figure 15.Non-unique locations of wall-roof edges because of the selection rules.

Figure 16 .
Figure16.Selection and matching of seed faç ades based on regional descriptions after the outliers were removed using RANSAC.The blue and white circles denote the candidate seed faç ades based on the regional description.The blue circles are the final seed faç ades, and the white circles are considered to be outliers and were removed by the RANSAC algorithm.The other nodes are the faç ades that have not yet been matched.

Figure 17 .
Figure 17.Results of the faç ade regional feature matching based on mutual geometric relationships with the seed faç ades.The correspondences are linked with lines.

Figure 18 .
Figure 18.Corresponding windows linked by lines.

Figure 19 .
Figure 19.The final results of matching X-corners connected by lines.

Table 1 .
Faç ade statistics of the corresponding window regional feature extraction ratio.

Table 2 .
Quantitative assessment of the proposed faç ade regional feature detection method.

Table 3 .
Confusion matrix and common performance metrics calculated from faç ade matching.

Table 4 .
Confusion matrix and common performance metrics calculated from window matching.

Table 5 .
Statistics of vectors fitted by X-corners and intersections, respectively.