Matching Aerial Images to 3D Building Models Using Context-Based Geometric Hashing

A city is a dynamic entity, which environment is continuously changing over time. Accordingly, its virtual city models also need to be regularly updated to support accurate model-based decisions for various applications, including urban planning, emergency response and autonomous navigation. A concept of continuous city modeling is to progressively reconstruct city models by accommodating their changes recognized in spatio-temporal domain, while preserving unchanged structures. A first critical step for continuous city modeling is to coherently register remotely sensed data taken at different epochs with existing building models. This paper presents a new model-to-image registration method using a context-based geometric hashing (CGH) method to align a single image with existing 3D building models. This model-to-image registration process consists of three steps: (1) feature extraction; (2) similarity measure; and matching, and (3) estimating exterior orientation parameters (EOPs) of a single image. For feature extraction, we propose two types of matching cues: edged corner features representing the saliency of building corner points with associated edges, and contextual relations among the edged corner features within an individual roof. A set of matched corners are found with given proximity measure through geometric hashing, and optimal matches are then finally determined by maximizing the matching cost encoding contextual similarity between matching candidates. Final matched corners are used for adjusting EOPs of the single airborne image by the least square method based on collinearity equations. The result shows that acceptable accuracy of EOPs of a single image can be achievable using the proposed registration approach as an alternative to a labor-intensive manual registration process.


Introduction
In recent years, a number of mega-cities such as New York and Toronto have built-up detailed 3D city models to support the decision-making process for smart city applications. These 3D models are usually static snapshots of the environment and represent the status quo at the time of their data acquisition. However, cities are dynamic systems that continuously change over time. Accordingly, their virtual representations need to be regularly updated in a timely manner in order to allow for accurate analysis and simulation results that decisions are based upon. In this context, a framework for continuous city modeling by integrating multiple data sources was proposed by [1].
A fundamental step to facilitate this task is to coherently register remotely sensed data taken at different epochs with existing 3D building models. Great research efforts have already been undertaken extracted incompletely, and inaccurately so that ideal edges might be broken into two or more small segments that are not connected to each other. Secondly, there is no strong disambiguating geometric constraint, whereas building models are reconstructed with certain regularities such as orthogonality, and parallelism.
Utilizing prior knowledge of building structures can reduce the matching ambiguities, and the search space. 3D object recognition method from single image based on the notion of perceptual grouping, which groups image lines based on proximity, parallelism and collinearity relations, was proposed in [21]. Also, hidden lines of objects, which do not appear in the image, were eliminated through visibility analysis to reduce search space and to increase the robustness of matching process. In [22], the work of [21] was extended to increase the robustness of the matching by devising a rule-based grouping method. However, a shortcoming of the approach is that matching fails in cases of nadir images where building walls and footprints are invisible in the image. Similar method was used to match 3D building models to aerial images by [23]. However, their implementation was only tested for a small number of buildings and it was limited to standard gable roof models. Also, a requirement of their approach was that each pixel must be within some range of an edge. 2D orthogonal corner (2DOC) was used in [24] as a feature to recover the camera pose for texture mapping of 3D building model. The coarse camera parameters were determined by vertical vanishing points that correspond to vertical lines in the 3D models. Correspondences between image 2DOC and DSM 2DOC were determined using Hough transform, and generalized M-estimator sample consensus. However, they described their error source as too limited to correct 2DOCs matches, in particular, for residential areas. Instead of using 2DOC, Wang et al. [25] proposed three connected segments (3CS) as a feature, which is more distinctive, and repeatable. For putative feature matches, they applied a two level RANSAC method, which consists of a local, and a global RANSAC for robust matching. Figure 1 illustrates the proposed method for registering a single image with existing 3D building models using extracted edged corner features. It starts by back-projecting the 3D building models to the image using initial (or at later steps updated) EOPs. Then with the help of the similarity measure, the matching process finds corresponding features using a CGH method. Based on the matched feature pairs, the EOPs of the single image are estimated by a least square adjustment. As shown in Figure 1, the second and third steps are conducted iteratively to find optimal EOPs until the corresponding matching pairs do not further improve. The three steps of the proposed method are further discussed in the following sub-sections whereat the last two steps are discussed together.

Registration Method
Sensors 2016, 16, 932 4 of 20 no strong disambiguating geometric constraint, whereas building models are reconstructed with certain regularities such as orthogonality, and parallelism. Utilizing prior knowledge of building structures can reduce the matching ambiguities, and the search space. 3D object recognition method from single image based on the notion of perceptual grouping, which groups image lines based on proximity, parallelism and collinearity relations, was proposed in [21]. Also, hidden lines of objects, which do not appear in the image, were eliminated through visibility analysis to reduce search space and to increase the robustness of matching process. In [22], the work of [21] was extended to increase the robustness of the matching by devising a rulebased grouping method. However, a shortcoming of the approach is that matching fails in cases of nadir images where building walls and footprints are invisible in the image. Similar method was used to match 3D building models to aerial images by [23]. However, their implementation was only tested for a small number of buildings and it was limited to standard gable roof models. Also, a requirement of their approach was that each pixel must be within some range of an edge. 2D orthogonal corner (2DOC) was used in [24] as a feature to recover the camera pose for texture mapping of 3D building model. The coarse camera parameters were determined by vertical vanishing points that correspond to vertical lines in the 3D models. Correspondences between image 2DOC and DSM 2DOC were determined using Hough transform, and generalized M-estimator sample consensus. However, they described their error source as too limited to correct 2DOCs matches, in particular, for residential areas. Instead of using 2DOC, Wang et al. [25] proposed three connected segments (3CS) as a feature, which is more distinctive, and repeatable. For putative feature matches, they applied a two level RANSAC method, which consists of a local, and a global RANSAC for robust matching. Figure 1 illustrates the proposed method for registering a single image with existing 3D building models using extracted edged corner features. It starts by back-projecting the 3D building models to the image using initial (or at later steps updated) EOPs. Then with the help of the similarity measure, the matching process finds corresponding features using a CGH method. Based on the matched feature pairs, the EOPs of the single image are estimated by a least square adjustment. As shown in Figure 1, the second and third steps are conducted iteratively to find optimal EOPs until the corresponding matching pairs do not further improve. The three steps of the proposed method are further discussed in the following sub-sections whereat the last two steps are discussed together.

Feature Extraction
Feature extraction is the first step of the registration task. As previously mentioned, feature selection should consider the properties of the given datasets, the application, and the required accuracy. In this study, we use two different types of features: edged corner features, and context features. An edged corner feature, which consists of a corner point, and the two associated lines that potentially intersect at this point ("arms"), provides local structure information for a building. In building models, it is relatively straightforward to extract this feature because each vertex of a building polygon can be treated as a corner and the connected lines as arms. Note that only rooftop polygons are considered for this. In an image with rich texture information, various corner detectors, and line detectors can be used to extract edged corner features. A context feature is defined as a characteristic spatial relation between two edged corner features selected within an individual roof. This context feature is used to represent global structure information so that more accurate, and robust matching results can be achieved. Section 2.1.1 explains the extraction of edged corner features from an image, and Section 2.1.2 describes the properties of context features.

Edged Corner Feature Extraction from Image
Edged corner features from a single image are extracted by three separate steps; (1) extraction of straight lines; (2) extraction of corners, and their arms; and (3) verification. The process starts with the extraction of straight lines from a single image by applying a straight line detector. We use Kovesi's algorithm, which relies on the calculation of phase congruency to localize, and link edges [26]. Then, corners are extracted by estimating the intersection of the extracted straight lines, considering the proximity with a given distance threshold (T d " 20 pixels). Afterwards, corner arms are determined by two straight lines used to extract the corner with fixed length (20 pixels). This procedure may produce incorrect corners because the proximity constraint is the only one considered. Thus, the verification process removes incorrectly extracted corners based on geometric and radiometric constraints. As a geometric constraint, the inner angle between two corner arms is calculated, and investigated to remove corners with sharp inner angles. In general, many of building structures appears in regular shapes following orthogonality and parallelism where small acute angles are found to be uncommon. Through this process, incorrectly extracted corners are filtered out by applying a user-defined inner angle threshold (T θ " 10˝). For the radiometric constraint, we analyze the radiometric values (Digital Number (DN) value or color value) of the left, and right flanking regions (F L 1 , F R 1 , F L 2 , F R 2 ) of each corner arm with a flanking width (ε) as used in [27]. Figure 2 shows a configuration of a corner, its arms, and the concept of the flanking regions. In a correctly extracted corner, the average DN (or color) difference between F L 1 and F R 2 , F L , should be large enough to underline the heterogeneity of two regions. Thus, we measure two radiometric properties: the minimum average DN difference value of two neighbor flanking regions for homogeneity measurement, ˘, and the maximum DN difference value of two opposite flanking regions for heterogeneity measurement, D hetero

˘.
A corner is considered an edged corner feature if the corner has a smaller D homo min than a threshold T homo , and if it has a larger D hetero max than a threshold T hetero . In order to determine thresholds for two radiometric properties, we assume that the intersection points are generated from both correct corners, and incorrect corners; and the two types of intersection points have different distributions with regards to their radiometric properties. Because there are two cases (correct corner and incorrect corner) for the average DN difference values, we can use the Otsu's binarization method [28] to automatically determine an appropriate threshold value. The method was originally designed to extract an object from its background for binary image segmentation based on histogram distribution. It calculates the optimum threshold by separating the two classes (foreground and background) in such a way that their intra-class variance is minimal. In our study, a histogram of homogeneity values (or heterogeneity values) for the entire selection of points is generated, and the optimal threshold for homogeneity (or heterogeneity) is automatically determined by Otsu's binarization method.
flanking regions ( , , , ) of each corner arm with a flanking width (ε) as used in [27]. Figure 2 shows a configuration of a corner, its arms, and the concept of the flanking regions. In a correctly extracted corner, the average DN (or color) difference between and , ‖ − ‖, or between and , ‖ − ‖, is likely to be small, underlining the homogeneity of two regions, while average DN difference between and , ‖ − ‖, or between and , ‖ − ‖, should be large enough to underline the heterogeneity of two regions. Thus, we measure two radiometric properties: the minimum average DN difference value of two neighbor flanking regions for homogeneity measurement, = (‖ − ‖, ‖ − ‖), and the maximum DN difference value of two opposite flanking regions for heterogeneity measurement, = (‖ − ‖, ‖ − ‖). A corner is considered an edged corner feature if the corner has a smaller than a threshold , and if it has a larger than a threshold . Figure 2. Edged corner feature (corner and its arms) and flanking regions.

Context Features
While an edged corner feature provides only local structure information about a building corner, context features partly impart global structure information related to the building configuration. Context features are set by selecting any two adjacent edged corner features, that is, four angles (θ ) as shown in Figure 3. Note that each angle is determined by the relative line connecting any two corners (l). The context feature, which is invariant under scale, translation, and rotation, is used to calculate contextual similarity in our proposed score function (see Section 2.2.2). In order to determine thresholds for two radiometric properties, we assume that the intersection points are generated from both correct corners, and incorrect corners; and the two types of intersection points have different distributions with regards to their radiometric properties. Because there are two cases (correct corner and incorrect corner) for the average DN difference values, we can use the Otsu's binarization method [28] to automatically determine an appropriate threshold value. The method was originally designed to extract an object from its background for binary image segmentation based on histogram distribution. It calculates the optimum threshold by separating the two classes (foreground and background) in such a way that their intra-class variance is minimal. In our study, a histogram of homogeneity values (or heterogeneity values) for the entire selection of points is generated, and the optimal threshold for homogeneity (or heterogeneity) is automatically determined by Otsu's binarization method.

Context Features
While an edged corner feature provides only local structure information about a building corner, context features partly impart global structure information related to the building configuration. Context features are set by selecting any two adjacent edged corner features, that is, four angles ( , , , ) between a line (l) connecting the two corners ( and ), and their arms ( , , , ) as shown in Figure 3. Note that each angle is determined by the relative line connecting any two corners (l). The context feature, which is invariant under scale, translation, and rotation, is used to calculate contextual similarity in our proposed score function (see Section 2.2.2).

Similarity Measurement and Matching
Similarity measurement and matching process take place in the image space after the 3D building models are back-projected onto the image space using the collinearity equations with the initial EOPs (or updated EOPs). In order to find reliable and accurate correspondences between features extracted from a single image, and building models, we introduce a CGH method where the vote counting scheme of a standard geometric hashing is supplemented by a newly developed similarity score function. The similarity score function consists of a unary term, and a contextual term. The unary term measures the similarity between edged corner features derived from the image, and models while the contextual term measures the geometric property of context features. In the following sections, the standard geometric hashing, and its limitations are described (Section 2.2.1), and our proposed CGH method is introduced (Section 2.2.2).

Geometric Hashing
Geometric hashing, a well-known indexing-based approach, is a model-based object recognition technique for retrieving objects in scenes from a constructed database [29]. In geometric hashing, an object is represented as a set of geometric features such as points, and lines, and by its geometric

Similarity Measurement and Matching
Similarity measurement and matching process take place in the image space after the 3D building models are back-projected onto the image space using the collinearity equations with the initial EOPs (or updated EOPs). In order to find reliable and accurate correspondences between features extracted from a single image, and building models, we introduce a CGH method where the vote counting scheme of a standard geometric hashing is supplemented by a newly developed similarity score function. The similarity score function consists of a unary term, and a contextual term. The unary term measures the similarity between edged corner features derived from the image, and models while the contextual term measures the geometric property of context features. In the following sections, the standard geometric hashing, and its limitations are described (Section 2.2.1), and our proposed CGH method is introduced (Section 2.2.2).

Geometric Hashing
Geometric hashing, a well-known indexing-based approach, is a model-based object recognition technique for retrieving objects in scenes from a constructed database [29]. In geometric hashing, an object is represented as a set of geometric features such as points, and lines, and by its geometric relations, which are transformation-invariant under certain transformations. Since only local invariant geometric features are used, geometric hashing can handle partly occluded objects. Geometric hashing consists of two main stages: the pre-processing stage, and the recognition stage. The pre-processing stage encodes the representation of the objects in a database, and stores them in a hash table. Given a set of object points (p k ; k " 0, . . . , n), a pair of points (p i and p j ) is selected as a base pair (Figure 4a). The base pair is scaled, rotated, and translated into the reference frame. In the reference frame, the magnitude of the base pair equals 1; the midpoint between p i and p j is placed at the origin of the reference frame. The vector´Ý In the subsequent recognition stage, the invariants, which are derived from geometric features in a scene, are used as indexing keys to assess the previously constructed hash table so that they can be matched with the stored models. In a similar way to the preprocessing stage, two points from a set of points in the scene are selected as the base pair. The remaining points are mapped to the hash table, and all entries in the corresponding hash table bin receive a vote. Correspondences are determined by a vote counting scheme, producing candidate matches.
Although geometric hashing can solve matching problems of rotated, translated, and partly occluded objects, it has some limitations. The first limitation is that the method is sensitive to the bin size used for quantization of the hash table. While a large bin size in the hash table cannot separate between two close points, a small bin size cannot deal with the position error of the point. Secondly, geometric hashing can produce redundant solutions due to its vote counting scheme [29]. Although it can significantly reduce candidate hypotheses, a verification step or additional fine matching step is required to find optimal matches. Thirdly, geometric hashing has a weakness in cases where the scene contains many features of similar shapes at different scales, and rotations. Without any constraints (e.g., position, scale and rotation) based on prior knowledge about the model, geometric hashing may produce incorrect matches due to the matching ambiguity. Fourthly, the complexity of processing increases by the number of base pairs, and the number of features in the scene [6]. To address these limitations, we enhance the standard geometric hashing by changing the vote counting scheme to a score function, and by adding several constraints such as scale difference of a base, and specific selection of bases.

Context-Based Geometric Hashing (CGH)
In this section, we describe the building model objects and the scene by sets of edged corner features. Edged corner features derived from input building models are used to construct the hash table in the pre-processing stage while edged corner features derived from the single image are used in the recognition stage. Each given building model consists of several planes. Thus, in the pre- In the subsequent recognition stage, the invariants, which are derived from geometric features in a scene, are used as indexing keys to assess the previously constructed hash table so that they can be matched with the stored models. In a similar way to the preprocessing stage, two points from a set of points in the scene are selected as the base pair. The remaining points are mapped to the hash table, and all entries in the corresponding hash table bin receive a vote. Correspondences are determined by a vote counting scheme, producing candidate matches.
Although geometric hashing can solve matching problems of rotated, translated, and partly occluded objects, it has some limitations. The first limitation is that the method is sensitive to the bin size used for quantization of the hash table. While a large bin size in the hash table cannot separate between two close points, a small bin size cannot deal with the position error of the point. Secondly, geometric hashing can produce redundant solutions due to its vote counting scheme [29]. Although it can significantly reduce candidate hypotheses, a verification step or additional fine matching step is required to find optimal matches. Thirdly, geometric hashing has a weakness in cases where the scene contains many features of similar shapes at different scales, and rotations. Without any constraints (e.g., position, scale and rotation) based on prior knowledge about the model, geometric hashing may produce incorrect matches due to the matching ambiguity. Fourthly, the complexity of processing increases by the number of base pairs, and the number of features in the scene [6]. To address these limitations, we enhance the standard geometric hashing by changing the vote counting scheme to a score function, and by adding several constraints such as scale difference of a base, and specific selection of bases.

Context-Based Geometric Hashing (CGH)
In this section, we describe the building model objects and the scene by sets of edged corner features. Edged corner features derived from input building models are used to construct the hash table in the pre-processing stage while edged corner features derived from the single image are used in the recognition stage. Each given building model consists of several planes. Thus, in the pre-processing stage, we select two edged corner features, which belong to the same plane of the building model as the base pair. It can reduce the complexity of the hashing table, and ensures that the base pair retains the spatial information of the plane. The selected base pair is scaled, rotated, and translated to define the reference frame. The remaining edged corner features which belong to the whole building model are also transformed with the base pair. In contrast to the standard geometric hashing, our hashing table contains model IDs, feature IDs of the base pair, the scale of the base pair (the rate of real distance of base pair), an index for member edged corner features, and context features generated by combinations with edged corner features. Figure 5 shows an example of the information to be stored in a hashing table.
Sensors 2016, 16,932 8 of 20 processing stage, we select two edged corner features, which belong to the same plane of the building model as the base pair. It can reduce the complexity of the hashing table, and ensures that the base pair retains the spatial information of the plane. The selected base pair is scaled, rotated, and translated to define the reference frame. The remaining edged corner features which belong to the whole building model are also transformed with the base pair. In contrast to the standard geometric hashing, our hashing table contains model IDs, feature IDs of the base pair, the scale of the base pair (the rate of real distance of base pair), an index for member edged corner features, and context features generated by combinations with edged corner features. Figure 5 shows an example of the information to be stored in a hashing table. Once all possible base pairs are set, the recognition stage tries to retrieve corresponding features based on the designed score function. Two edged corner features from the image are selected as base pair with two constraints: (1) scale constraint; and (2) position constraint. As a constraint on a scale, only those base pairs whose scale is similar to the scale of the base pair in the hash table are considered with an assumption that the initial EOPs provide an approximate scale of the image. Thus, if the scale ratio is smaller than a user defined threshold ( = 0.98), the base pair is excluded from the set of possible base pairs. In addition to scale constraint, the possible positions of a base pair can be also restricted with a proper searching space. This searching space can be determined by calculating error propagation with the amount of assumed errors (calculated by the iterative process) for initial EOPs (updated EOPs) of the image, and the models. These two constraints reduce the matching ambiguity, and the complexity of processing. After the selection of possible base pairs from the image, all remaining edged corner features in the image are transformed based on a selected base pair. Afterwards, the optimal matches are determined by comparing a similarity score. The process starts by generating context features from the model, and the image in a reference frame. Given a model that consists of five edged corner features (black color), ten context features can be generated as shown in Figure 6. Note that all edged corner features derived from the model are not matched with edged corner features derived from the image (red color). Thus, only edged corner features, which have corresponding image edged corner features within the search area (n = 4 in Figure 6), and their corresponding context features (m = 6 in Figure 6 (red long-dash)) are considered in the calculation of the similarity score function.  Once all possible base pairs are set, the recognition stage tries to retrieve corresponding features based on the designed score function. Two edged corner features from the image are selected as base pair with two constraints: (1) scale constraint; and (2) position constraint. As a constraint on a scale, only those base pairs whose scale is similar to the scale of the base pair in the hash table are considered with an assumption that the initial EOPs provide an approximate scale of the image. Thus, if the scale ratio is smaller than a user defined threshold (T s " 0.98), the base pair is excluded from the set of possible base pairs. In addition to scale constraint, the possible positions of a base pair can be also restricted with a proper searching space. This searching space can be determined by calculating error propagation with the amount of assumed errors (calculated by the iterative process) for initial EOPs (updated EOPs) of the image, and the models. These two constraints reduce the matching ambiguity, and the complexity of processing. After the selection of possible base pairs from the image, all remaining edged corner features in the image are transformed based on a selected base pair. Afterwards, the optimal matches are determined by comparing a similarity score. The process starts by generating context features from the model, and the image in a reference frame. Given a model that consists of five edged corner features (black color), ten context features can be generated as shown in Figure 6. Note that all edged corner features derived from the model are not matched with edged corner features derived from the image (red color). Thus, only edged corner features, which have corresponding image edged corner features within the search area (n = 4 in Figure 6), and their corresponding context features (m = 6 in Figure 6 (red long-dash)) are considered in the calculation of the similarity score function. processing stage, we select two edged corner features, which belong to the same plane of the building model as the base pair. It can reduce the complexity of the hashing table, and ensures that the base pair retains the spatial information of the plane. The selected base pair is scaled, rotated, and translated to define the reference frame. The remaining edged corner features which belong to the whole building model are also transformed with the base pair. In contrast to the standard geometric hashing, our hashing table contains model IDs, feature IDs of the base pair, the scale of the base pair (the rate of real distance of base pair), an index for member edged corner features, and context features generated by combinations with edged corner features. Figure 5 shows an example of the information to be stored in a hashing table. Once all possible base pairs are set, the recognition stage tries to retrieve corresponding features based on the designed score function. Two edged corner features from the image are selected as base pair with two constraints: (1) scale constraint; and (2) position constraint. As a constraint on a scale, only those base pairs whose scale is similar to the scale of the base pair in the hash table are considered with an assumption that the initial EOPs provide an approximate scale of the image. Thus, if the scale ratio is smaller than a user defined threshold ( = 0.98), the base pair is excluded from the set of possible base pairs. In addition to scale constraint, the possible positions of a base pair can be also restricted with a proper searching space. This searching space can be determined by calculating error propagation with the amount of assumed errors (calculated by the iterative process) for initial EOPs (updated EOPs) of the image, and the models. These two constraints reduce the matching ambiguity, and the complexity of processing. After the selection of possible base pairs from the image, all remaining edged corner features in the image are transformed based on a selected base pair. Afterwards, the optimal matches are determined by comparing a similarity score. The process starts by generating context features from the model, and the image in a reference frame. Given a model that consists of five edged corner features (black color), ten context features can be generated as shown in Figure 6. Note that all edged corner features derived from the model are not matched with edged corner features derived from the image (red color). Thus, only edged corner features, which have corresponding image edged corner features within the search area (n = 4 in Figure 6), and their corresponding context features (m = 6 in Figure 6 (red long-dash)) are considered in the calculation of the similarity score function.   The newly designed score function consists of a unary term, which measures the position differences of the matched points, and a contextual term, which measures length and angle differences of corresponding context features, as follows: score " αˆ«wˆř n i"1 U piq n`p 1´wqˆř where: α is an indicator function where the minimum number of features to be matched is determined depending on T c (T c " 0.5, at least 50% of corners in the model should be matched with corners from the image) so that all features of the model do not need to be detected in the image; n and m are the number of matched edged corner features, and context features, respectively; w is a weight value which balances the unary term and the contextual term; in our case, w = 0.5 is heuristically selected: Unary term: The unary term U piq measures the position distance between edged corner features derived from the model, and the image in a reference frame. The position difference P M i´P I i between an edged corner feature in the model and its corresponding feature in the image is normalized by the distance N P i calculated by the EOP error propagation on the image plane: Contextual term: This term is designed to measure the similarity between context features in terms of length and four angles. The contextual term is calculated for all context features which are generated from matched edged corner features. For the length difference, L M ij´L I ij , the difference between lengths of context features in the model, and in the image is normalized by length N L ij of the context feature in the model. For angle differences, the angle difference θ M k ij´θ I k ij between the inner angles of a context feature is normalized by N θ ij (N θ ij " π 2 ): For each model, a base pair, and its corresponding corners which maximize the score function are selected as optimal matches. Note that if the maximum score is smaller than a certain threshold T m , the matches are not considered as matched corners. The role of T m is to determine an optimal subset of accurate matching correspondences for estimating EOP parameters. High T m values provide a low number of matching correspondences with high accuracy. In contrast, low T m values increase the number of matching correspondences but they also decrease their accuracy. Once all correspondences are determined, the EOPs of the image are adjusted through space resection using pairs of object coordinates of the existing building models, and newly derived image coordinates from the matching process. Values calculated from the similarity score function are used to weight matched pairs. The process continues until matched pairs do not change.

Experimental Results
The proposed CGH-based registration method was tested on benchmark datasets over the downtown areas in Toronto (ON, Canada) and Vaihingen in Germany provided by the ISPRS Commission III, WG3/4 [30]. Table 1 shows characteristics of reference building models, which were used to determine EOPs. For the Toronto datasets, two different types of reference building models were prepared by: (1) a manual digitization process conducted by human operators; and (2) using a state-of-the art algorithm [31] from airborne LiDAR point clouds. These two building models were used to investigate their respective effects on the performance of our method (Figure 7). For the Vaihingen datasets, LiDAR-driven building models were automatically generated by [32] and adjusted as described in [33] as shown in Figure 8. A total of 16 check points for each dataset, which were evenly distributed throughout the images, were used to evaluate the accuracy of the EOPs. adjusted as described in [33] as shown in Figure 8. A total of 16 check points for each dataset, which were evenly distributed throughout the images, were used to evaluate the accuracy of the EOPs.  For the Toronto dataset, various analyses were conducted to evaluate the performance of the proposed registration method in detail. From the aerial image, a total of 90,951 straight lines were extracted and 258,486 intersection points were derived by intersecting any two straight lines found adjusted as described in [33] as shown in Figure 8. A total of 16 check points for each dataset, which were evenly distributed throughout the images, were used to evaluate the accuracy of the EOPs.  For the Toronto dataset, various analyses were conducted to evaluate the performance of the proposed registration method in detail. From the aerial image, a total of 90,951 straight lines were extracted and 258,486 intersection points were derived by intersecting any two straight lines found For the Toronto dataset, various analyses were conducted to evaluate the performance of the proposed registration method in detail. From the aerial image, a total of 90,951 straight lines were extracted and 258,486 intersection points were derived by intersecting any two straight lines found within 20 pixels of proximity constraint. Out of these, 57,767 intersection points were selected as edged corner features following the removal of 15%, and 60% of intersection points using geometric constraint (T θ " 10˝), and radiometric constraints (T homo " 26, and T hetero " 55), respectively ( Table 2). The T homo and T hetero were automatically determined by Otsu's binarization method. Figure 9 shows edged corner features extracted from the aerial image. As many of the intersection points are not likely to be corners, the majority of them were removed. The method correctly detected corners and arms in most cases even though some corners were visually difficult to detect due to their low contrasts. within 20 pixels of proximity constraint. Out of these, 57,767 intersection points were selected as edged corner features following the removal of 15%, and 60% of intersection points using geometric constraint ( = 10°), and radiometric constraints ( = 26, and = 55), respectively ( Table 2).
The and were automatically determined by Otsu's binarization method. Figure 9 shows edged corner features extracted from the aerial image. As many of the intersection points are not likely to be corners, the majority of them were removed. The method correctly detected corners and arms in most cases even though some corners were visually difficult to detect due to their low contrasts.  After the existing building models were back-projected onto the image using error-contained EOPs, edged corner features were extracted from the vertices of the building models in the image space ( Figure 10). It should be noted that two different datasets were used as the existing building models. Some edged corner features extracted from both existing building models were not observed in the image due to occlusions caused by neighbor building planes. Also, some edged corner features, in particular those extracted from LiDAR-driven building models, do not match with the edged corner features extracted from the image due to modeling errors caused by irregular point distribution, occlusions and the reconstruction mechanism. Thus, correspondences between edged corner features from the image and from the existing building models are likely to be partly established.
The proposed CGH method was applied to find correspondences between features derived from the image and from existing building models. When manually digitized building models are used as building models, a total of 693 edged corner features (7.8% of edged corner features extracted from the entire building models) were matched using the parameters given in Table 3.  After the existing building models were back-projected onto the image using error-contained EOPs, edged corner features were extracted from the vertices of the building models in the image space ( Figure 10). It should be noted that two different datasets were used as the existing building models. Some edged corner features extracted from both existing building models were not observed in the image due to occlusions caused by neighbor building planes. Also, some edged corner features, in particular those extracted from LiDAR-driven building models, do not match with the edged corner features extracted from the image due to modeling errors caused by irregular point distribution, occlusions and the reconstruction mechanism. Thus, correspondences between edged corner features from the image and from the existing building models are likely to be partly established.
The proposed CGH method was applied to find correspondences between features derived from the image and from existing building models. When manually digitized building models are used as building models, a total of 693 edged corner features (7.8% of edged corner features extracted from the entire building models) were matched using the parameters given in Table 3. It is noted that only models whose vertices were greater than were considered to find possible building matches. For LiDAR-driven building models, only 381 edged corner features (4.9% from the entire building models) were matched ( Table 2). It is noted that the number of matched edged corner features is influenced by the quality of the existing building models, and thresholds used, in particular. As shown in Table 2, more edged corner features are matched when manually digitized building models were used as the existing building models than when LiDAR-driven building models were used. If is set as a small value, the number of matched edged corner features increases, but this increases the risk it may contain a large number of incorrect matched edged corner features. The effect on the will be discussed in detail later. Based on matched edged corner features, EOPs for the image were calculated by applying the least square method based on co-linearity equations. For qualitative assessment, the existing models were back-projected to the image with refined EOPs. Each column of Figures 11 and 12 shows backprojected building models with error-contained EOPs (a), matched edged corner features (b), and back-projected building models with refined EOPs (c). In the figures, boundaries of the existing building models are well matched to building boundaries in the image with refined EOPs.
In our quantitative evaluation, we assessed the root mean square error (RMSE) of check points back-projected onto the image space using refined EOPs (Table 4). When reference building models were used as the existing building models, the results show that the average difference in x and y directions are −0.27 and 0.33 pixels, respectively, with RMSE of ±0.68 and ±0.71 pixels, respectively. The results with LiDAR-driven buildings models show that the average differences in x and y directions are −1.03 and 1.93 pixels, with RMSE of ±0.95 and ±0.89 pixels, respectively. Although LiDAR-driven building models are used, the accuracy of the EOPs is less than 2 pixels in image space (approximately 30 cm in ground sample distance (GSD)). Considering that the point space (resolution) of the input airborne LiDAR dataset is larger than 0.3 m, the refined EOPs provide a greater accuracy for engineering applications. It is noted that only models whose vertices were greater than T c were considered to find possible building matches. For LiDAR-driven building models, only 381 edged corner features (4.9% from the entire building models) were matched ( Table 2). It is noted that the number of matched edged corner features is influenced by the quality of the existing building models, and thresholds used, T m in particular. As shown in Table 2, more edged corner features are matched when manually digitized building models were used as the existing building models than when LiDAR-driven building models were used. If T m is set as a small value, the number of matched edged corner features increases, but this increases the risk it may contain a large number of incorrect matched edged corner features. The effect on the T m will be discussed in detail later. Based on matched edged corner features, EOPs for the image were calculated by applying the least square method based on co-linearity equations. For qualitative assessment, the existing models were back-projected to the image with refined EOPs. Each column of Figures 11 and 12 shows back-projected building models with error-contained EOPs (a), matched edged corner features (b), and back-projected building models with refined EOPs (c). In the figures, boundaries of the existing building models are well matched to building boundaries in the image with refined EOPs.
In our quantitative evaluation, we assessed the root mean square error (RMSE) of check points back-projected onto the image space using refined EOPs (Table 4). When reference building models were used as the existing building models, the results show that the average difference in x and y directions are´0.27 and 0.33 pixels, respectively, with RMSE of˘0.68 and˘0.71 pixels, respectively. The results with LiDAR-driven buildings models show that the average differences in x and y directions are´1.03 and 1.93 pixels, with RMSE of˘0.95 and˘0.89 pixels, respectively. Although LiDAR-driven building models are used, the accuracy of the EOPs is less than 2 pixels in image space (approximately 30 cm in ground sample distance (GSD)). Considering that the point space (resolution) of the input airborne LiDAR dataset is larger than 0.3 m, the refined EOPs provide a greater accuracy for engineering applications.     The error distribution of 16 check points is illustrated in Figure 13. The error distributions showed that the interquartile range (IQR) for both manually digitized and LiDAR-driven building models were under 1.5 pixels. The maximum error value for LiDAR-driven models was however 1 pixel greater than for manually digitized models. The error distribution of 16 check points is illustrated in Figure 13. The error distributions showed that the interquartile range (IQR) for both manually digitized and LiDAR-driven building models were under 1.5 pixels. The maximum error value for LiDAR-driven models was however 1 pixel greater than for manually digitized models. In this study, threshold, has an effect on the accuracy of the EOPs. In order to evaluate the effect of , we estimated the matched number of edged corner features, and calculated the average error and the RMSE of the check points with different values of . As shown in Table 5, the number of matched features is inversely proportional to the value of , regardless of which existing building models are used. However, the effect of on the accuracy is not the same for both building models. We observed affects the matching accuracy of digitized building models less than it does for LiDAR-driven building models. Furthermore, the matching accuracy tends to get worse with very low or high values. The latter can be explained by the low number of matched features, giving us insufficient data to accurately adjust the EOPs of the image. In the other case, if a low value is selected, the number of matched features increases, but so does the number of incorrect matches if the building models are inaccurate. Thus, we can observe that LiDAR-driven building models, reconstructed with relatively lower accuracy compared to the manually digitized models, produced more sensitive results in the matching accuracy according to . In contrast, the matching accuracy of the manually digitized building models remains high because of high model accuracy. In summary, a higher accuracy of the building models can lead to a higher EOP accuracy, while the value of should be determined by balancing the ratio of correct matched features and incorrect matched features. In order to evaluate the effect on context feature, we set weight parameter w in score function (Equation (1)) as 1 and 0.5, respectively, and then compared the results. When w = 1, the score function considers only the unary term without the effect of the contextual term so that the contextual force is ignored. As shown in Table 6, the results show that registration with only unary terms causes In this study, threshold, T m has an effect on the accuracy of the EOPs. In order to evaluate the effect of T m , we estimated the matched number of edged corner features, and calculated the average error and the RMSE of the check points with different values of T m . As shown in Table 5, the number of matched features is inversely proportional to the value of T m , regardless of which existing building models are used. However, the effect of T m on the accuracy is not the same for both building models. We observed T m affects the matching accuracy of digitized building models less than it does for LiDAR-driven building models. Furthermore, the matching accuracy tends to get worse with very low or high T m values. The latter can be explained by the low number of matched features, giving us insufficient data to accurately adjust the EOPs of the image. In the other case, if a low T m value is selected, the number of matched features increases, but so does the number of incorrect matches if the building models are inaccurate. Thus, we can observe that LiDAR-driven building models, reconstructed with relatively lower accuracy compared to the manually digitized models, produced more sensitive results in the matching accuracy according to T m . In contrast, the matching accuracy of the manually digitized building models remains high because of high model accuracy. In summary, a higher accuracy of the building models can lead to a higher EOP accuracy, while the value of T m should be determined by balancing the ratio of correct matched features and incorrect matched features. In order to evaluate the effect on context feature, we set weight parameter w in score function (Equation (1)) as 1 and 0.5, respectively, and then compared the results. When w = 1, the score function considers only the unary term without the effect of the contextual term so that the contextual force is ignored. As shown in Table 6, the results show that registration with only unary terms causes considerably low accuracy in both cases. In particular, with LiDAR-driven models, the accuracy is heavily affected. These results indicate that the use of context features has a positive effect on resolving the matching ambiguity and thus improving the EOP accuracy by reinforcing contextual force. We also analyzed various impacts of errors in initial EOPs on the matching accuracy by adding different levels of errors to evaluate our proposed method. Each parameter of the EOPs leads to different behaviors from back-projected building models: X 0 and Y 0 parameters are related to the translation of back-projected building models; Z 0 is related to scale; ω 0 and ϕ 0 cause shape distortion; κ 0 is related to rotation ( Figure 14). In order to assess the effects on translation and scale, errors ranging from 0 m to 25 m were added to three position parameters. To assess the shape distortion and rotation effects, errors ranging from 0˝to 2.5˝were added to three rotation parameters. Figure 15 shows the accuracies of the refined EOPs with different level of errors for each EOP parameter. Regardless of errors in the initial EOPs, RMSE of under 2 pixels for manually digitized building models, and RMSE of under 3 pixels for LiDAR-driven building models were achieved. The results indicate that the accuracy of the refined EOPs was less affected by the amount of initial EOPs errors. This is due to the fact that the EOPs converge to the optimum solution iteratively. considerably low accuracy in both cases. In particular, with LiDAR-driven models, the accuracy is heavily affected. These results indicate that the use of context features has a positive effect on resolving the matching ambiguity and thus improving the EOP accuracy by reinforcing contextual force. We also analyzed various impacts of errors in initial EOPs on the matching accuracy by adding different levels of errors to evaluate our proposed method. Each parameter of the EOPs leads to different behaviors from back-projected building models: and parameters are related to the translation of back-projected building models; is related to scale; and cause shape distortion; is related to rotation ( Figure 14). In order to assess the effects on translation and scale, errors ranging from 0 m to 25 m were added to three position parameters. To assess the shape distortion and rotation effects, errors ranging from 0° to 2.5° were added to three rotation parameters. Figure 15 shows the accuracies of the refined EOPs with different level of errors for each EOP parameter. Regardless of errors in the initial EOPs, RMSE of under 2 pixels for manually digitized building models, and RMSE of under 3 pixels for LiDAR-driven building models were achieved. The results indicate that the accuracy of the refined EOPs was less affected by the amount of initial EOPs errors. This is due to the fact that the EOPs converge to the optimum solution iteratively. In order to evaluate the robustness of the proposed registration method, the algorithm was applied to the Vaihingen dataset. A total of 31,072 edged corner features from the image and 11,812 edged corner features from the existing building models were extracted using the parameters set in Table 3. A total of 379 edged corner features were matched by the CGH method where was heuristically set as 0.7, and other parameters were set by Table 3. The results of the extracted and matched features are summarized in Table 7. Sixteen check points were evaluated for error-contained EOPs and refined EOPs. The accuracies of the check points with refined EOPs show that the average difference for x and y directions are 0.67 and 0.91 pixels with RMSE of ±1.25 and ±1.49 pixels respectively (Table 8). A summary of the error distribution for the 16 check points is presented in Figure 16. The results suggest that the proposed registration method can achieve accurate and robust matching results even though building models with different error types were used for the registration of a single image. In order to evaluate the robustness of the proposed registration method, the algorithm was applied to the Vaihingen dataset. A total of 31,072 edged corner features from the image and 11,812 edged corner features from the existing building models were extracted using the parameters set in Table 3. A total of 379 edged corner features were matched by the CGH method where T m was heuristically set as 0.7, and other parameters were set by Table 3. The results of the extracted and matched features are summarized in Table 7. Sixteen check points were evaluated for error-contained EOPs and refined EOPs. The accuracies of the check points with refined EOPs show that the average difference for x and y directions are 0.67 and 0.91 pixels with RMSE of˘1.25 and˘1.49 pixels respectively (Table 8). A summary of the error distribution for the 16 check points is presented in Figure 16. The results suggest that the proposed registration method can achieve accurate and robust matching results even though building models with different error types were used for the registration of a single image.

Discussion
In this paper, we proposed a new model-to-image registration method which can align a single image with 3D building models. Edged corner features, represented by a corner and its associated edges, and context features are proposed as the matching features. Edged corner features are extracted from the image by calculating the intersection of two neighboring straight lines, and verified using geometric and radiometric properties. For similarity measurement, and matching, the CGH method was proposed to compensate for the limitations of the standard geometric hashing method. The qualitative assessment showed that the boundaries of the existing building models, back-projected by refined EOPs, are well aligned with boundary lines from the image. Meanwhile, the quantitative assessment showed that both manually digitized building models, and LiDARdriven building models can be used to evaluate the EOPs of a single image with acceptable and reliable accuracy. More specifically, experimental results are summarized as follows:


The quality of building models directly affects the accuracy of EOPs. When manually digitized building models were used, the proposed registration method accurately and reliably achieved the EOPs regardless of threshold and assumed error. However, if building models contain more modeling errors, the accuracy of EOPs is reduced, which are more susceptible to threshold, and assumed errors.  Contextual features employed in geometric hashing enhances matching performance. This is because contextual values provide information about the relation between edged corner features, characterizing geometric properties of individual roof polygon. In particular, the use of context features, which provide global information of building models, that is at larger scale (object-level) than at using single corners only (point-level), plays a significant role in our enhanced geometric hashing method, and making our matching performance more robust to errors involved in building models used.  The proposed method can iteratively recover the EOPs of a single image in spite of considerable error in their initial values, which exceed error amounts permitted in commercial aerial image acquisition.
As future work, we will extend the proposed method to arbitrarily acquired images (e.g., UAV images, and security camera images).