Good Practices for Object-Based Accuracy Assessment

Thematic accuracy assessment of a map is a necessary condition for the comparison of research results and the appropriate use of geographic data analysis. Good practices of accuracy assessment already exist, but Geographic Object-Based Image Analysis (GEOBIA) is based on a partition of the spatial area of interest into polygons, which leads to specific issues. In this study, additional guidelines for the validation of object-based maps are provided. These guidelines include recommendations about sampling design, response design and analysis, as well as the evaluation of structural and positional quality. Different types of GEOBIA applications are considered with their specific issues. In particular, accuracy assessment could either focus on the count of spatial entities or on the area of the map that is correctly classified. Two practical examples are given at the end of the manuscript.


Introduction
Accuracy assessment is an acknowledged requirement in the process of creating and distributing thematic maps [1].It is a necessary condition for the comparison of research results and the appropriate use of map products.However, despite the fact that the need to carry out (and document) accuracy assessment is recognized by the scientific community, further attention to rigorous assessment is still needed [2].A general framework of good practices for accuracy assessment has been described in [3,4] in order to guide the validation effort.However, as mentioned in [5], Geographic Object-Based Image Analysis (GEOBIA) validation has its own characteristics.It is, therefore, necessary to adjust some of the good practices of the general framework for the specific quality assessment of GEOBIA results.These practices depend on the type of geospatial database that is going to be extracted from the image analysis.
There are at least three important stages in the assessment process leading to the design of a geographic database: (i) observation, (ii) relating the observation to a conceptual model and (iii) representing the data in formal terms [6].In the first stage, the surveyor must decide whether the observation is a clearly defined or definable entity, or if it is a continuum, i.e., a smoothly-varying surface.The first situation refers to an exact entity, called the discrete entity in Burrough [6] or the spatial object in Bian [7].In this study, they will be named spatial entities.The second situation refers to a (continuous) field, and it is not well represented by GEOBIA [7].However, viewing the world as either a continuum or as a set of discrete entities is always a matter of appreciation.At the level of the empirical, one-to-one scale geographic world, it is indeed not possible to decide if the world is made up of discrete, indivisible elementary entities, or if it is a continuum with different properties at different locations.On one side, houses, trees and humans are the non-subdivisible smallest units for most purposes of geographic-scale modeling and study; on the other side, extensive entities such as oceans, prairies, forest and geological formations may be subdivided within very wide limits and still maintain their identity [8].A third category has therefore been added, which is called spatial region [7] or the object with an indeterminate boundary [6].The term spatial region used in this paper represents a mass of individuals that can be conceptualized both as a continuous field and as discrete entity, which is often the case for land cover.This duality is also found for their coding in a database: they can be discretized as vector polygons with boundaries based on a general agreement (which are represented as adimensional lines, but which are fuzzy in reality), or they can be represented as an arbitrary grid specifying the proportion of each individual entity at each pixel (without boundaries) [7].The three conceptual models (spatial entities, spatial regions and continuous field) are illustrated in Figure 1.The third model (continuous field) will not be discussed further in this paper.GEOBIA is often considered as a new image classification paradigm in remote sensing [9].It encompasses a large set of tools that incorporate the knowledge of a set of (usually) adjacent pixels to derive object-based information.GEOBIA is primarily applied to Very High Resolution (VHR) images, where spatial entities/regions are visually composed of many pixels and where it is possible to visually validate them [9].Most of the time, these groups of pixels (called image-segments in Hay and Castilla [10]) are derived from automated image segmentation (e.g., [11]) or superpixel algorithms [12] before the classification stage.In addition to the frequently-used segmentation/classification workflow, some machine learning algorithms have been developed that recognize groups of pixels as a whole [13], and polygons are sometimes taken from ancillary data (e.g., [14,15]).In any case, GEOBIA relates to the conceptual model of geo-objects, defined as a bounded geographic region that can be identified for a period of time as the referent of a geographic term [16].Geo-objects, which encompass spatial entities or spatial regions, are represented (and analyzed) as points, lines or (most of the time) polygons.GEOBIA then assigns a single set of attributes to each polygon as a whole, in contrast to pixel-based classification, where each individual pixel is classified.The polygons used in GEOBIA are thus intrinsically considered as homogeneous in terms of the label [10,17,18], even when those labels are fuzzy [19].
According to Bian [7], polygons are a good representation for spatial entities and a reasonable one for spatial regions, which might have indeterminate boundaries.Although many geographic-scale and anthropogenic entities have boundaries, the most precise boundaries existing in the geographic world are immaterial: these are administrative and property boundaries [8].At the other extreme are the self-defining boundaries of social, cultural and biological territories.Extracting the boundaries is often part of the GEOBIA process, which then contributes to the discretization process.The quality of the boundaries should therefore most of the time be considered in addition to the thematic accuracy.
There is a wide range of applications of GEOBIA, and good validation practices depend on the selected application.In this study, we will focus on thematic (categorical) geographic database using the polygon representation of spatial entities or regions, as opposed to the pixel-based approach where the support of the analysis is the (sub)pixel captured by a sensor.With those thematic products, where a single category is assigned to each polygon (i.e., thematic maps), the accuracy of the labels assigned to the geo-objects is of primary concern.However, in addition to the thematic accuracy assessment, spatial quality should also be taken into account for GEOBIA [20].The structural and positional quality of the polygons are therefore addressed, as suggested by [5].The comparison of segmentation algorithms before the classification stage is, however, out of the scope of this paper, as this study will consider the quality of the final results, notwithstanding the specific numerical methods that were involved for their production.The proposed framework assesses thematic and geometric errors using distinct indices in order to identify the different sources of uncertainty for the derived product.Synthetic indices mixing the two types of errors (e.g., [21,22]) are therefore not discussed.

General Accuracy Assessment Framework
The general accuracy assessment framework defined in [3,4] is still applicable to most GEOBIA results.The standard practice consists of reporting the map accuracy as an error matrix when the map and reference classifications are based on categories [23].Recommendations about the quality assessment are commonly divided in three components that have the same importance to draw rigorous conclusions and that should be reported to end users, namely the analysis, the response design and the sampling design.
Analysis includes the quality indices to be used to address the objectives of the mapping project, along with the method to estimate those indices [4].Quality indices that are directly interpretable as probabilities of encountering certain types of misclassification errors or correct classifications should be selected in preference to quality indices not interpretable as such [24].In order to derive unbiased estimation of these indices, the probability to select each sampling unit should be considered.
Response design is the protocol for determining the ground condition (reference) classification of a sampled spatial unit (pixel, block or polygon) [25].The fundamental basis of an accuracy assessment is a location-specific comparison between the map classification and the "reference" classification [1].The response design includes the choice of a type of sampling unit and the rules that allow the operator to decide if a sampling unit was correctly labeled (or not).
The sampling design is the way a representative subset of the geospatial database is selected to perform the accuracy assessment.In practice, it is indeed impossible to have precise (low uncertainty) and exact (unbiased) information that exhaustively covers the study area [4,25], except when using synthetic data.A subset of the geo-objects is, therefore, selected in order to estimate the accuracy of the map.The subset should be drawn with a probabilistic method, in order to be representative of the whole study area [23].The sample can be stratified (i) to increase the proportion of samples in error-prone areas in order to reduce the variance of the estimator [26,27] and/or (ii) to balance the number of sampling units for each category in order to avoid large variances of estimation for low frequency classes [4].
As discussed above, GEOBIA can be oriented towards objectives, related with two different conceptual models: spatial regions or spatial entities.Furthermore, GEOBIA can have different goals.In an attempt to provide pragmatic guidelines, four types of GEOBIA applications are considered in this paper:

•
Wall-to-wall mapping: This is the most common application, which results in thematic maps proposing a complete partition of the study area into classified polygons, such as for land cover or land use maps.

•
Entity delineation: The goal is the delineation of selected spatial entities (e.g., buildings [32], vine parcels [33], crop fields [34]) with a focus on the precision of their boundaries.

•
Enhanced pixel classification: GEOBIA is used to improve image classification at the pixel level by reducing the within-class variability (speckle removal) or by computing additional characteristics (texture, structure, context) [35][36][37].
In the following sections, specific GEOBIA issues will be addressed with respect to these four types of applications, when relevant.Section 3 lays focus on the way to analyze the confusion matrix to estimate relevant quality indices.Section 4 is related to the response design and addresses the choice of the sampling unit (pixel or polygon).The methods to create a subset of polygonal sampling units are then described in Section 5. Finally, Section 6 addresses further issues related to the scale of the GEOBIA application, as well as specific geometric errors linked to the segmentation process.

Analysis of Quality Indices
For a pixel-based validation, the proportion of correctly classified pixels (count-based classification accuracy) is equal to the proportion of the area that is correctly classified (area-based map accuracy).In other words, if one knows how many pixels are misclassified, one also knows the area of the map that is misclassified.Indeed, standard validation procedures usually assume that pixels have all the same area (assuming pixels of equal area makes sense at small scale factor, but a rigorous pixel-based quality assessment over large areas should take into account that some coordinate systems do not preserve equal areas).Polygons, however, do not necessarily have the same areas, and their differences in size can be very large in some landscapes (e.g., a landscape including anthropogenic buildings and forests at the same time).In terms of the proportion of the area of the map that is incorrectly classified, an error for a large polygon has obviously more impact than an error for a smaller one.This variable area must be taken into account for an unbiased estimate of some accuracy indices.
Figure 2 illustrates the differences between count-based and area-based accuracy on simplified examples, and it emphasizes the need of selecting indices that are meaningful for the purpose of the study.On one side, count-based accuracy (Section 3.1) should be used for the assessment of spatial entity detection, and on the other side, area-based accuracy (Section 3.2) should be used for spatial region mapping and for the classification of pixels.In the case of spatial entity delineation, the geometric precision is usually more important than the thematic accuracy.Nevertheless, the automated delineation could sometimes misclassify some of the spatial entities.Count-based indices can then be useful to estimate the probability of missing entities.Illustration of the differences between area-and count-based accuracies for geo-objects, where correctly classified cells are in green and incorrect ones are in red (connected cells are part of a single spatial entity).The three synthetic examples have the same area-based accuracy (33% of the surface is misclassified), but different count-based accuracies.On the left, the three geo-objects have the same area, therefore area-based and count-based accuracies are identical.In the center, there is one large and one small geo-object: the size of the misclassified geo-object has an impact on the area-based accuracy, but not on the count-based accuracy.On the right, each geo-object is correctly detected (based on a majority rule), but there are segmentation errors.The area-based accuracy is then smaller than the count-based accuracy.

Accuracy of Spatial Entity Detection
In the case of entity detection, the validation process focuses on correct detection rates.In a binary classification, the response design identifies four possible cases: (i) the True Positives (TP) are the entities that are correctly detected by the method; (ii) the False Positives (FP) are the polygons that are incorrectly detected as entities; (iii) the False Negatives (FN) are entities that are not detected by the method; and (iv) the True Negatives (TN) are the correctly undetected polygons.Indices based on true and false detections are widely used in computer science and related thematic disciplines.However, GEOBIA detection does not solely rely on the classification of entities.Indeed, for polygons that have been classified as a spatial entity, over-segmentation (more than one image-segment for one spatial entity) and under-segmentation (more than one spatial entity encapsulated inside one image-segment) contribute to the false positives and the false negatives, respectively.As a consequence, the number of subparts minus one should, therefore, be counted as false positive, and the number of encapsulated geo-objects minus one should add to the false negative in these situations.
In the binary case, e.g., detection of dead trees in a dense population of trees, Powers [38] suggests three indices to summarize the information of the contingency table that is relevant for the comparison between methods, namely the informedness (Equation (1)), the markedness (Equation (2)) and the Matthews' correlation coefficient (Equation (3)), with: However, in most spatial entities' detection context, TN is indeterminate because it would correspond to the background, which is not of the same nature as the spatial entities and is usually not countable (e.g., cows (spatial entities) in a pasture (spatial region)).In those cases, the count-based user accuracy (also called positive predicted value, Equation ( 4)) and the count-based producer accuracy (also called sensitivity, Equation ( 5)) provide a better summary of the truncated confusion matrix, where: Producer accuracy = TP/(TP + FP) It is furthermore easy to show that the users' and producers' accuracies for the various classes are equal to the corresponding markedness and informedness for these classes when TN tends to infinity (i.e., if all pixels of the background were considered as TN).
In addition, Count Accuracy (CA) has been widely used in remote sensing applications, such as tree counting [39] or animal counting.It consists of dividing the number of detections (TP and FN) by the number of geo-objects in the reference, with: While this index provides meaningful information at the plot level for documenting, e.g., the density of trees in a parcel or the total number of animals in a landscape, it does not provide location-specific information due to commission and omission errors that are canceled by the aggregation [40].We therefore recommend to compute CA on a set of randomly selected regions of equal areas, in order to provide an estimate of its variance in addition to its mean value.

Accuracy of a Wall-To-Wall Map
For wall-to-wall maps, the area of the map that is correctly classified is the main concern.The primary map accuracy indices suggested by the comparative study of Liu et al. [41], namely overall accuracy, producers' accuracies and users' accuracies are recommended.Pixel-based accuracy estimators and their variance can be found in the general good practices from Olofsson et al. [4].For GEOBIA results, the same indices should be used.However, with polygon sampling units, the estimators of the primary accuracy indices should take the variable size of polygons into account in order to avoid bias and to reduce the variance of the predictors.
In the context of polygon-based estimation of the thematic accuracy, correctly classifying large polygons will contribute more to the map quality than correctly classifying smaller ones.Unbiased predictors of the primary accuracy indices have, therefore, been suggested in Radoux and Bogaert [42].Under the hypothesis of binary agreement rules and representative samples, these indices use the knowledge about the area of the polygons to predict overall accuracy (Equation ( 7)), user accuracies (Equation ( 8)) and producer accuracies (Equation ( 9)), with: where N is the total number of polygons in the map, n is the number of sampled polygons and S i is the area of the i-th polygon.α ij ,β ij are binary indicators of the actual and predicted labels j of polygon i. ∑ k j=1 α ij β ij is therefore equal to one, if the i-th polygon was correctly classified, and zero, otherwise.p j is an estimate of the probability to belong to class j, and p b i is an estimate of the classification accuracy (Ω is the set of all possible classes in the map).The main difference with the classification accuracy estimated with the standard point-based estimators is the inclusion of the size of the polygons in the first term of the numerators of each equation.Furthermore, the area of the non-sampled polygons is also taken into account, so that all information at hand about the polygons is used.Finally, it is worth noting that the second terms of the numerators of Equations ( 7)-( 9) have more influence when the number of polygons is large: the above predictors simplify to the standard estimators when N becomes infinite if the count-based accuracy is not influenced by the size of the polygons.
Relationships between size and classification accuracy have been highlighted in previous studies [42,43] and could impact the predictors.In theory, these relationships can be plugged into Equations ( 7)-( 9) by replacing p j with a function of the size (Radoux and Bogaert [42]).In practice, it is not trivial to identify this function and to estimate its parameters.A pragmatic solution consists of (i) splitting the sample based on quantiles for the areas of the polygons, then (ii) estimating p j q for each size category q and, finally, (iii) multiplying p j q by the sum of the polygons' areas in that size category q.Another solution consists of estimating the area-based accuracy from the sampled polygons only, with: where n is the number of polygons in the sample.This formula (that can be adapted for both users' and producers' accuracy estimates) has a larger variance than Equation ( 7), but it is unbiased.It is therefore of appropriate use, when the size of the sample is small and p j q is difficult to estimate.So far, the best approximation of the prediction variance of the overall accuracy estimator given by Equation ( 7) is: which can be used to derive the confidence interval around the predicted overall accuracy index (to the best of our knowledge, there are no simple formulas for the prediction variance of the polygon-based users' and producers' accuracies).Equations ( 7)-( 9) will be particularly efficient when the number of polygons in the map is small, and they are at least as efficient as a point-based sampling when the number of polygons is very large [44].

Response Design
The response design encompasses all steps of the protocol that lead to a decision regarding the agreement of the reference and map classifications [4].Response design with GEOBIA is an underestimated issue.Indeed, because classification systems involved in GEOBIA are often complex, not only accuracy issues (e.g., mistaken photo-interpretation), but also precision issues (e.g., difficulty to estimate the proportion of each class of the legend) could affect the quality of the reference dataset.
The sampling unit must be defined prior to specifying the sampling design, and it is not necessarily set by the map representation [25].In the case of GEOBIA, the use of polygons sampling units is usually recommended [27,45] because the legend is defined at the scale of the polygons and not at the scale of the pixels.However, there is no universal agreement on a best sampling unit [1].In order to select a type of sampling unit, the conceptual model and the sampling effort must indeed be taken into account.
For spatial entities, polygon sampling units prevail because distinct individual entities are extracted.Individual pixels indeed need to be aggregated at different levels to compose a geo-object.For instance, a high resolution pixel identified as a window could be the windshield of a vehicle or the skylight of a building roof top.The photo-interpretation of very high resolution images, therefore, intrinsically takes context into account to identify spatial entities.The response design should, however, include rules for handling imperfect matches between the reference and the image-segment.Topological relationship, such as intersection or containment, based on the outlines or the centroids, define unambiguous binary agreement rules with a tolerance for delineation errors.Alternatively, an entity may be considered correctly labeled if the majority of its area, as determined from a reference classification, corresponds to the map label [1].
For spatial regions, various classes (e.g., urban area, open forest, mixed forest, orchard) are also defined at the polygon level and would not exist at the scale of a high resolution pixel [46].Those classes typically (i) represent complex concepts where the context plays a major role for the interpretation or (ii) generalize the landscape.The generalization of the landscape offered by GEOBIA may better represent how land cover interpreters and analysts actually perceive it [36].The selected legend is then strongly related to the scale at which a spatial region is observed and linked with minimum mapping units, which are not necessarily identical for all classes (e.g., [47]).However, when using polygons as sampling units, one should make sure that the object-based generalization is suitable to represent the variable of interest.Otherwise, pixel sampling units are more relevant.
When polygon sampling units are selected, the rules that define the agreement are usually simpler when the reference data are based on the same polygons as those on the map.Nevertheless, object-based quality assessment could need information about the segmentation quality in addition to the thematic accuracy assessment.As discussed in Sections 6.1 and 6.2, manually digitized or already existing polygons of reference may then be needed.The use of manually digitized polygons for the assessment of the thematic accuracy is however complex because the aggregation rules are then applied on different supports.
Issues related to the heterogeneity of polygon sampling units are similar to the so-called mixed pixels problem.It is therefore of paramount importance to clearly define the labeling rules in order to avoid ambiguous validation results.There is however no universal set of decision rules that fits to all mapping purposes.As a consequence, the same landscape can be described with different legends (e.g., Figure 3), which cannot be compared without a fitness to purpose analysis.For the thematic accuracy assessment, the response design links the observed reality to the type of legend of that has been selected for the map.Those legends can be categorized into majority-based, rule-based (such as the Land Cover Classification System (LCCS) [48]) or object-oriented :

•
Majority rules assign a label based on the dominant label inside each polygon.The legend includes the minimum number of classes necessary to represent each of type of geo-object that are identified with a smaller scale factor.The labeled polygons are usually interpreted as "pure" geo-objects.It is, therefore, most suited when GEOBIA aims at identifying spatial entities, but is frequently used with spatial regions, as well.In this case, the polygons are used to generalize the landscape (e.g., mapping the forest and not the trees) or to avoid classification artifacts.When there are more than two classes, it is worth noting that the majority class could have a proportion that is less than 50%.Majority rules could therefore be source of confusions.

•
Rules-based legends (e.g., FAO LCCS rules) are built upon a set of thresholds on the proportion of each land cover within the spatial unit.When correctly applied, those rules generate a comprehensive partition of all of the possible combinations of its constitutive spatial entities.They are well suited to the characterization of complex land cover classes using spatial regions.
Each LCCS class is characterized with a unique code that guarantees a good interoperability.

•
The goal of object-oriented legends is to build human-friendly representations of geo-objects based on the knowledge that people cognitively have about the geospatial domain [49].
The conceptual schema describing information are linked with formal representation of semantics (i.e., ontologies) [50] in an attempt to capture and organize common knowledge about the problem's domain.The concepts are often well understood (e.g., an urban area), but difficult to quantitatively define.Land use and ecosystem maps belong to this category.Each concept has a set of properties and relations between spatial entities, which involve geometric and topological relationships.When GEOBIA is used as a means to improve classification results at the pixel level, treating a polygon as 100% homogeneous could lead to a biased estimate of the number of pixels in a class.In order to properly assess the impact of GEOBIA on pixel classification, it is therefore suggested to use pixel-based sampling units.Indeed, the alternative method combining object-based thematic accuracy assessment and a measure of the polygons heterogeneity is likely to be less cost effective to achieve the same results.
The variance of the overall accuracy estimation with polygon sampling units is smaller or equal to the variance of a pixel-based validation under the same type of sampling scheme and with the same number of sampling units [51], but there are two disadvantages of using polygon sampling units.Firstly, the response design with polygons is often more complex than with pixels, especially when geometric quality needs to be assessed.Polygon-based validation therefore does not necessarily reduce the total sampling effort.Secondly, polygon sampling units are bounded to a given partition of the landscape.They are therefore difficult to reuse for change detection [25] if the same segmentation is not used for all dates.

Sampling Scheme for Polygons
In practice, it is usually not possible to apply a response design to the entire map.A sampling design is then used to select a subset of the polygons that aims to be representative of the whole dataset.In relationship with the analysis section, the sampling design should be based on the polygons list in order to achieve equal probability sampling for each polygon.A pragmatic solution consists in randomly shuffling the list of polygons, then selecting the n first ones.This method yields an equal probability sample without replacement.In contrast, the selection of polygon sampling units based on randomly selected pixels on the map will select polygons with a probability proportional to their area [1].This method should, therefore, be used with care, because existing estimators of the quality indices assume an equal probability sampling.For the same reason, systematic sampling cannot rely on the selection of regularly spaced polygons.Alternatively, a regular coverage of the study area with nearly equal probability sampling can be achieved by randomly selecting one polygon inside each cell of a regular grid several times larger than the average size of the polygons.
Because the classification accuracy also depends on the size of the polygons, Stehman and Wickham [1], Hernando Gallego et al. [5] suggested to stratify the sampling, based on categories for the polygon areas.This kind of stratification could replace a class-based stratification because polygon area is also correlated with class probability [42].Accounting both, for these areas and class probabilities in the stratification, could thus be unnecessary and would produce too many bins, which would result in less reliable estimations of the users' and producers' accuracies.In any case, the area covered by each strata should be used to adjust the estimates of indices accuracy, as a generalization of the equation of Card [52] for pixel-based stratification: where π k is the proportion of the area of the stratum k that is correctly classified and w k is the proportion of the area of the map covered by the stratum.

Sampling Scheme for Entity Counting
Most of the time, detected geo-objects only cover a portion of the area of interest.A sample of spatial entities detected by the image analysis is, therefore, unable to measure the omission errors.On the other hand, spatial entities may cover only a small part of the area.Building a reference dataset based on random point-based sampling is not practical, because the sampling probability of the spatial entities would be too low.An external wall-to-wall dataset with the location of the spatial entities is therefore needed.When the exhaustive validation of the study area is impractical, region-based sampling can be used to provide a subset of territory from where to extract reference data.Whiteside et al. [20] used buffers around random point samples; however, systematic sampling based on regular grids [53] (or beehives) is a sound alternative.

Geometric Precision
When the boundaries are extracted during the GEOBIA process, information about their quality could be needed.The boundary extraction depends on the scale of the analysis (scale is the great boundary maker according to Couclelis [8]) and on the resolution of the input data.We must therefore distinguish two uncertainty types: we may be uncertain (i) with respect to the precise location of a crisp boundary due to measurement errors or (ii) with respect to indefinite correspondence between concepts (e.g., ecotone) and the represented world [8].Furthermore, there is a duality between the definition of a boundary and the definition of its interior.The geometric quality of the delineation can, therefore, focus more either on the interior of the geo-object (structural quality) or on its boundaries (positional quality), depending on the application.

Structural Quality
Structural quality is related to the ability of the polygon to enclose a specific patch [5].Structural quality depends on the segmentation quality and can, in turn, affect the thematic quality.The two facets of structural quality are over-segmentation and under-segmentation.On the one hand, over-segmentation consists in subdividing an geo-object into two or more smaller polygons.It is usually due to a larger-than-needed scale of analysis.On the other hand, under-segmentation consists of including sizable parts in the interior of the polygon that had rather been placed in a different polygon.The impact of over-and under-segmentation on the quality of the product depends on the purpose of the analysis.
Structural quality is particularly important in the case of entities detection because it directly impacts the geo-objects count.Under-segmentation can induce under-detection, while over-segmentation tends to contribute to an overestimation of the number of geo-objects.Over-and under-segmentation are quantified based on the comparison between the polygons and reference polygons (often manually delineated on the image).Under-segmentation is measured by the number of mapped polygons having their centroids inside the reference polygons and the opposite can be done for over-segmentation [32].However, in practice, it may not be necessary to derive specific indices related to over-and under-segmentation in the case of spatial entities.Indeed, as mentioned in Section 3, over-segmentation errors are already contributing to the count of true positives (hit) and under-segmentation errors will increase the number of false negatives (miss).
For area-based mapping, it is commonly accepted that a poor quality of the segmentation leads directly to a low quality of the classification [54].However, over-segmentation does not have a direct impact on the final quality because correctly classified polygons can be merged after classification if necessary [55].For instance, Grenier et al. [56] used merged polygons from different levels for validation.Over-segmentation could, however, affect the thematic accuracy, as previous studies highlighted that the classification accuracy depends on the size of polygons [42,43], but its impact is not always negative.As this impact can be measured by the thematic accuracy indices, investigating over-segmentation is not necessary for spatial region mapping.
On the other hand, under-segmentation can adversely impact the quality of the map because subparts are missing.The definition of the classes then plays a major role in the accuracy assessment and could lead to major misunderstandings.GEOBIA is indeed often linked with a data model representing a continuous reality with discrete boundaries [57].As discussed in the introduction, spatial heterogeneity is a component of many spatial regions and entities.For instance, a car loses its functionality if its wheels are removed, and a savanna would not host the same animal species with a continuous tree cover.Heterogeneity is, therefore, not necessarily an undesired characteristic of geo-objects, but it can be optimized at a given scale.
The response design (Section 4) should take heterogeneity into account and specify when individual parts compose the geo-objects (e.g., windows on a roof top of a building, sparsely distributed trees in an open forest) and when they are erroneously included (e.g., island in a lake).In the case of an inclusion, a minimum mapping unit is needed to determine if the polygon is correctly delineated.This minimum mapping unit is linked to the scale of the map, but can also differ for each class of a geographic database because geo-objects occur at different scales [47].
From a quality assessment perspective, under-segmentation should thus be considered as a precision issue and not an accuracy issue.A well defined legend indeed includes decision rules for all possible associations of individual parts, so that each predicted label can be matched with a corresponding reference label.However, the variance of the proportions of elementary units inside each polygon is larger when there is a large probability to have inclusions.A GEOBIA product with large under-segmentation issues could, therefore, be accurately labeled, but might provide imprecise estimates of the area covered by elementary units.
In practice, assessing the structural consistency of the polygons would be very time consuming because it is similar to the area-based accuracy assessment of a map, but applied to each sampling unit.A wall-to-wall approach, similar to the area-based segmentation goodness indices [58], can thus be used for each polygon.In practice, this level of quality assessment is only necessary for the delineation of spatial entities, and it is sometimes useful for large spatial regions.

Positional Quality
Due to the scale of the errors, positional quality of polygons has less impact on the thematic quality assessment than pixel sampling units [1].However, positional errors do affect the map quality.Those errors could be due to (i) geometric errors in the source data or (ii) delineation errors resulting from the segmentation process.Information about the spatial precision of the map should, therefore, be provided in order to document the scale at which it can be used, and to detect potential bias due to the difference between the actual boundaries and those observed by remote sensing.The uncertainty on the position of these boundaries may come from different sources: (i) the spatial resolution of the sensor; (ii) orthorectification errors (e.g., residual parallax along large height gradients); (iii) segmentation errors; (iv) fuzziness due to gradual transitions between adjacent regions (e.g., ecotone) and (v) fuzziness due to variability over time (e.g., sea shore).
The boundaries of spatial entities are usually sharp and easy to describe with a set of widely accepted rules.However, the actual position of the boundary between two spatial regions is often fuzzy (e.g., ecotone between two biotopes).Previous studies further showed that some geo-objects were more complex to accurately digitize than others, and that there was a high degree of variability among image interpreters when hand-digitizing the same geo-objects [59].The geometric precision of GEOBIA should, therefore, focus on true land cover transitions when available [5], while median boundaries are recommended when the photo-interpreters disagree [60].
In order to document the uncertainty of the position of the boundaries, the distance between the reference boundaries and the corresponding boundaries in the geographic database has to be measured.The Root Mean Square Error (RMSE) or CE 95 (the radial error, which 95% of all errors in a circular distribution will not exceed) [61], are standard indices to report positional errors and are directly related to the cartographic scale.There is no standard method to estimate these errors, but two types of approaches have been proposed:

•
Surface-based methods consider the geo-object as an entity to be compared with a polygon of reference.Areas of the controlled polygons in and out of the reference polygon are then measured for different buffer sizes [17] so that the proportion of correctly delineated geo-objects is reported for a set of distances.This method can thus be used to empirically determine the buffer distance that embeds a given proportion of the errors [62].

•
Edge-based methods measure some statistics about the distance between the reference edge and the delineated edge.The RMSE can be derived from point samples along the reference boundaries compared with their perpendicular projection on the polygon boundaries [63].
The mean absolute error is estimated by dividing the area of mismatch by the length of the boundary [5,64].The proportion of the boundaries within a given tolerance has also been suggested [21].These methods are more difficult to implement, but can be useful to describe the errors for different types of edges.The fuzziness and the measurement errors indeed depend on the type of edges [6,63].
Alternative methods looking at the boundaries include a comparison of the distance between vertices [58] or the overlap of buffers around the perimeters of the two geo-objects [65].They are, however, more difficult to relate to scale: the former could be affected by the density of vertices and the latter is not designed to reach 100% because the same buffer is applied to each boundary.

Overview of the Recommended Practices
Details about the quality assessment of GEOBIA products with a crisp classification are provided in the previous sections.From a more pragmatic viewpoint, it is not necessary to estimate all indices in all cases.Table 1 summarizes the recommended practices with respect to the objectives of the GEOBIA described in Section 2.
One needs to first identify if GEOBIA is used to enhance per-pixel classification or if the conceptual model of the geographic database is better represented with polygons.In the first case, pixel sampling unit is recommended, because: (i) it simplifies the response design and (ii) it does not require specific boundary or structural quality control.Standard methods for the accuracy assessment of pixel-based classification (see [4]) are thus fully applicable in those cases.
When the legend is defined at the scale of the polygons, the thematic accuracy assessment should be based on the polygon sampling units.An appropriate response design should then take the conceptual model into account in order to define the binary agreement between the polygons and the "real world" (see Section 4).Within object-based conceptual models, GEOBIA products are usually either a wall-to-wall map of the landscape into (mainly) spatial regions, or a spatially discontinuous set of one (sometimes several) types of spatial entities.These two types of GEOBIA products should be validated with the set of indices that reflect their purpose.
In case of wall-to-wall mapping, (i) area-based accuracy assessment indices are recommended (Section 3) and (ii) additional information about the precision of the boundaries and the generalization of the content could be needed (Section 6).The response design should be based on the content of each polygon.Due to the variable size of those polygons, the area of each sampling unit has to be taken into account in order to provide an unbiased estimate of the accuracy indices.
Finally, studies dedicated to a set of spatial entities usually aim at either updating large scale vector maps (entity delineation) or inventorying a specific type of entity (entity counting).For the former, thematic accuracy is not a primary concern and the validation should focus on the geometric quality of the polygons (Section 6).For the latter, count-based quality assessment (Section 3.1) should be applied with a response design that addresses possible segmentation errors.However, the quality of the delineation is usually not an issue as long as the spatial entities are correctly detected.In case of poor delineation, the response design should clearly state the matching rules that define when a spatial entity has been detected.

Illustration of Tree Crown Detection
This example is a synthetic case study of a typical GEOBIA analysis where the purpose is to count a specific type of spatial entities.For instance, these could be houses, trees, tents, vehicles or animals.It is worth noting that, especially for vehicles and animals, the reference must come from the same image as the one that is used from the image analysis, therefore assuming that photo-interpretation will be closer to reality.Looking at images of the same sensor at different dates could further help to consolidate the photo-interpretation because it is unlikely that moving spatial entities remain exactly at the same place on two different dates.
In this case study, the photo-interpreters are asked to locate one point at the centroid of each spatial entity that they identify on the map.The response design considers here, as a match, a centroid that is closer than 2 m from the centroid of the corresponding reference entity.However, given that our simulated results include 11,000 points (see Figure 4), validating all of them would be cumbersome and a subset region has to be selected.
At this stage, only detected entities are available, and they are homogeneously distributed in the study area.A systematic region-based sampling without stratification is, therefore, selected.In other circumstances, the density of detected entities could be used to stratify the sampling.
A priori, the number of points needed to assess the quality of the GEOBIA results could be computed based on Equation (14), as in Olofsson et al. [4], with: E.g., with an expected accuracy of p = 0.9, one needs approximately n = 864 points to estimate the TP rate with a confidence interval d of ±2% at a 95% confidence level (for which z = 1.96).With a grid of 100 cells, there will be 110 detected entities per cell on the average.Eight of those cells are, therefore, randomly selected to obtain a sufficient sample size, and reference data is then collected inside each cell (see Figure 4).As the spatial entities are sparsely distributed, true negatives are undefined.Equations ( 4) and ( 5) are, therefore, used to estimate the user's and producer's accuracies based on the number of true positives (TP = 793), false negatives (FN = 87) and false positives (FP = 85), with: PA = 793/(87 + 793) = 90.1% When the sample is based on regions, the total number of sampling units is a priori unknown.Now that the total number of points is known for each sampled region and the accuracy indices have been estimated, the 95% confidence interval can be updated by replacing the guessed values with the observed values.Equation (17) gives the result for U A, which is approximately the same as for PA.
CI U A = 1.96 × 0.9 × (1 − 0.9)/878 = 1.4% (17) Finally, the count accuracy can be computed from Equation (6) for each cell, then the bias and the confidence interval on this quantity can be measured.In this example, omissions and commissions errors largely compensate each other, but there is a variability between cells.The CA here is equal to 1.002 ± 0.04 at 95% confidence level.The confidence interval is based on the variance of the CAs computed for each sampled cell.

Wall-to-Wall Land Cover Map
The wall-to-wall land cover map presented here is a real case study and was produced in the framework of the Lifewatch European Research Consortium.The version assessed here is the first public release (Version 2.7), which was based on 25-cm-resolution aerial images of 2015, and LIDAR data from 2012/2013.It covers 16,844 km 2 of the Walloon region (Belgium) with 1,000,000+ polygons.
A stratified random sampling is used for the validation, based on three super-classes: open areas (crops, pasture and vegetation mixtures with small tree cover), forest areas (broadleaved, needle-leaved, mixed forests, and vegetation mixtures dominated by trees) and others (sparse urban, dense urban, bare soils, water and wetlands).The classes are grouped in three strata in order to minimize confusions within them and to include all classes.The characteristics of each stratum is summarized in Table 2.A lower accuracy was expected for the open areas stratum due to the low separability between some crops and pastures.More points were therefore selected in this category according to Equation ( 14), with values of p = 0.8, 0.9 and 0.9 for the overall accuracies of the Open areas, Forest areas and Other areas, respectively.Considering the large (>10 6 ) number of polygons in the map, Equation ( 14) can be used as a conservative approximation to compute the number of required sampling units.The number of validation samples was then empirically optimized for a precise estimation of the overall accuracy, aiming at a confidence interval of ±2.5% at a 95% confidence level, with: CI OA = 1.96 × 0.46 2 × 0.8(1 − 0.8)/392 + 0.33 2 × 0.9(1 − 0.9)/219 + 0.20 2 × 0.9(1 − 0.9)/142 ≈ 2.5% (18) For each stratum, randomly selected polygons are validated by photo-interpretation based on orthophotos and ancillary data from authentic sources.The response design used threshold-based decision rules that are provided in the Appendix A. Table 3 summarizes the outputs in terms of number and area of polygons that are correctly classified within each group.In order to take into account the relationship between area and classification accuracy, the count-based accuracy was estimated with quantiles of polygons area.Three quantiles were used with the Open areas stratum and only two were used for the other strata, as a compromise between the number of bins and the precision of the estimates inside each bin.For the Open areas stratum, the count-based accuracy of the smallest polygons was significantly smaller than for the medium and largest polygons, with values respectively equal to 82.3%, 90.8% and 92.4%.There is a significant size effect too for the Others stratum (83.1% and 90.1% respectively for small and large polygons), but there is no significant size effect for the Forest areas stratum (83.5% and 85.5% respectively for small and large polygons).As an example, the overall accuracy for the Other areas stratum is computed as: by plugging the observed values in Equation (7).The area in the sample is 112 hectares of correctly classified polygons and the total area of the polygons from the Others stratum is 347, 128 hectares (see Table 2).The count-based accuracies estimated from the small and large polygons are multiplied the total area of the small and large polygons in the set of non sampled polygons.The overall accuracy of the whole map is computed as the weighted sum of the overall accuracy of each stratum (Table 3) based on their relative proportion (Table 2).The total overall accuracy value is 88.3%.As the measured overall accuracies per stratum were different from the first guess used for the sampling design, the confidence interval needed to be updated.By reusing Equation (18) with the observed values, it was updated to ±2.3%.
Both count-based (Table 4) and area-based (Table 5) confusion matrices are presented in this example to highlight the differences between them.On average, there is an absolute difference of 2% between the various users' accuracies and 4% between the various producers' accuracies.The largest difference (15%) is observed for the producers' accuracy for water (W), because the omission errors occurred for very small polygons.) and open vegetation mixture with some trees (TM2).The table also includes the Count-based Users' (U A) and Producers' Accuracies (PA).The values in this table were not weighted by their sampling probability for the sake of clarity, but the relative number of polygons in each stratum is taken into account for the computation of the producers' accuracy.The structural quality of the polygons was quantified using the proportion of the dominant class.The average proportion within the sampled polygons was 80%.As mentioned in Section 4, polygon heterogeneity is sometimes necessary to describe complex landscapes.However, it is often difficult to objectively decide when the structural quality could be improved.In this study, a larger average proportion was observed for correctly classified than for incorrectly classified polygons (87% and 66%, respectively).This tendency to better classify the homogeneous classes was also captured by the confusion matrix, where most of the errors occurred at the transition between two classes.This suggests that the thematic accuracy of the map could be improved if the under-segmentation was reduced or if the legend could be optimized for the specifically heterogeneous regions.On the other hand, the homogeneity is a very class-specific feature that is related to the landscape.Crop is the most homogeneous land cover in the study area (98%), closely followed by bare soil and water (>95%).Grassland, which might encapsulate isolated trees, has a homogeneity value of 91%.Forests are the most heterogeneous (≈80%) of the homogeneous classes due to natural and artificial thinning or gaps.The other classes are heterogeneous by definition, ranging from 48% to 66%, and are not defined by the majority class.For instance, one third of the urban areas have a majority class that is not built up, but they are nevertheless classified as urban areas because they are covered by at least 25% of buildings and roads.

Class
With respect to geometric precision, reference boundaries have been selected from correctly classified polygons in the homogeneous classes in order to avoid processing fuzzy boundaries during the quantitative assessment.The area of the omission and commission errors on both sides of the reference boundaries (Figure 5) is divided by the length of the reference boundary.The result provides an estimate of the mean distance to the boundaries.As shown in Table 6, those errors depend on the type of boundaries.Similar differences have been observed in previous studies [63].Table 6.Geometric errors along the boundaries for the main homogeneous classes.

Conclusions
The recommended good practices for GEOBIA quality assessment that were presented here are intending to extend the general guidelines from Olofsson et al. [4] in order to yield rigorous and defensible accuracy estimates in the case of GEOBIA.The objectives of the analysis should be clearly identified prior to the validation process in order to select the most appropriate approach.Critical choices include (i) the type of sampling unit (pixel or polygon), (ii) the types of accuracy indices (count-based or area-based) and (iii) the relevance of geometric quality assessment.GEOBIA accuracy assessment can be more complex than pixel-based accuracy assessment, but in turn provides more information such as area-dependent classification accuracy or class-specific boundary errors.
The proposed guidelines are based on arbitrary categories of GEOBIA objectives that are not exhaustive and not always exclusive.For instance, spatial entity detection and delineation could be combined in a single study.On the other hand, GEOBIA could also be used to derive quantitative parameters or class memberships that would require a specific validation framework.Furthermore, some key issues still need to be addressed in future studies.In order to further improve the accuracy assessment framework, standard methods for the quality assessment of boundaries are still needed and the impact of response design on the quality indices should be investigated.Furthermore, several indices proposed in this study still lack a formula for their variance, which is an issue for the comparison of algorithms and for the reporting of the estimated values.Last but not least, good practices for the estimation of areas based on GEOBIA wall-to-wall maps should be developed.

Figure 1 .
Figure 1.Illustration of the conceptual models for a Mediterranean forest in the South of France.In this landscape covered by an Ikonos image, trees are spatial entities that can be individually detected (stars on the left image).In the center, automated image segmentation creates spatial regions that correspond to the old firebreak and the undisturbed forest composed of trees, shrub and herbaceous vegetation.On the right, a continuous field representation highlights the lower (darker gray) vegetation density in the old firebreak.

Figure 2 .
Figure2.Illustration of the differences between area-and count-based accuracies for geo-objects, where correctly classified cells are in green and incorrect ones are in red (connected cells are part of a single spatial entity).The three synthetic examples have the same area-based accuracy (33% of the surface is misclassified), but different count-based accuracies.On the left, the three geo-objects have the same area, therefore area-based and count-based accuracies are identical.In the center, there is one large and one small geo-object: the size of the misclassified geo-object has an impact on the area-based accuracy, but not on the count-based accuracy.On the right, each geo-object is correctly detected (based on a majority rule), but there are segmentation errors.The area-based accuracy is then smaller than the count-based accuracy.

Figure 3 .
Figure 3. Different legends applied on the same landscape.The central image is a pixel-based representation with a majority-rule legend.Two other representations are built consistently with this classification.On the left, the classes are based on a set of thresholds derived from LCCS-like rules.On the right, a majority-based legend is used.The extent of the urban area is largely affected depending on the chosen type of legend, and this has to be taken into account for the response design.

Figure 4 .
Figure 4. Illustration of region-based sampling for the validation of spatial entities detection.Reference data were collected for eight randomly selected cells and compared with the detected entities.True positives are colored in green, false positives in red and false negatives in black.The grey points have not been validated.

Figure 5 .
Figure 5. Illustration of the segmentation errors along the boundaries.The highlighted polygons indicate mismatch with the boundaries of the reference polygons.

Table 1 .
Recommended validation practices for four objectives of geographic object-based image analysis.The sign − indicates that indices are not necessary and + indicates that they are important.The sign ± indicates that the decision depends on multiple factors.

Table 2 .
Characteristics of the strata used for the stratified sampling.

Table 3 .
Area-based overall accuracy values for each stratum.CCP are the Correctly Classified Polygons in the sample according to the response design.NSP are the Non-Sampled Polygons.OA is the Overall Accuracy for each stratum.

Table 5 .
Area-based confusion matrix.The LCCS-based legend includes six homogeneous classes as well as associations of different entities from larger scale.The homogeneous classes are Cropland (C), Broadleaved Forest (BLF), Needle-Leaved Forest (NLF), Grassland (G), Bare soils (B) and Water (W).Heterogeneous classes include Urban areas (U), Sparse Vegetation (SV), Crop-dominated Mixture (CM1), Mixture of vegetation with Crops (CM2), Tree-dominated vegetation mixture without crops (TM1) and open vegetation mixture with some trees (TM2).The table also includes the Count-based Users' (U A) and Producers' Accuracies (PA).The values in this table were not weighted by their sampling probability for the sake of clarity, but the relative number of polygons in each stratum is taken into account for the computation of the producers' accuracy.