Statistical Methods for Thematic-Accuracy Quality Control Based on an Accurate Reference Sample

: The goal of this work is to present a set of statistical tests that o ﬀ er a formal procedure to make a decision as to whether a set of thematic quality speciﬁcations of a product is fulﬁlled within the philosophy of a quality control process. The tests can be applied to classiﬁcation data in thematic quality control, in order to check if they are compliant with a set of speciﬁcations for correctly classiﬁed elements (e.g., at least 90% classiﬁcation correctness for category A) and maximum levels of poor quality for confused elements (e.g., at most 5% of confusion is allowed between categories A and B). To achieve this objective, an accurate reference is needed. This premise entails changes in the distributional hypothesis over the classiﬁcation data from a statistical point of view. Four statistical tests based on the binomial, chi-square, and multinomial distributions are stated, to provide a range of tests for controlling the quality of product per class, both categorically and globally. The proposal is illustrated with a complete example. Finally, a guide is provided to clarify the use of each test, as well as their pros and cons.


Introduction
The thematic component of a spatial data product is expressed as a set of classes, or category assignments (e.g., land-cover and land-use classes, geological and pedological classes, and so on). This thematic component of spatial data is of great importance in environmental modeling, decision making, climate change assessment, and so on. It is well known that spatial data are not error-free [1][2][3], and that spatial data sets represent a significant source of error in any analysis that uses them as input [4,5].
The term thematic accuracy has been broadly used when speaking about the quality of the thematic component of spatial data. The International Standard ISO 19,157 [6] defines it as the accuracy of quantitative attributes and the correctness of non-quantitative attributes and the classifications of features and their relationships. Classification correctness is defined, by the same standard, as the comparison of the classes assigned to features or their attributes to a universe of discourse (e.g., ground truth or reference data) [6]. Thus, prior to a classification correctness assessment, a well-defined classification scheme is needed. A classification scheme has two critical components [7]: (i) a set of labels, and (ii) a set of rules for assigning labels. In this way, in the case of a crisp classification, a unique assignment of classes is achieved if the classes (labels) are mutually exclusive (there are no overlaps) and exhaustive (there are no omissions or commissions of classes). classified elements (e.g., at least 90% of classification correctness for category A), and (ii) maximum levels of poor quality for confused elements (e.g., at most 5% of confusion allowed between categories A and B; for an example see Table 2 of the next section). Note that not all quality levels for confusions involve individual categories. We claim the possibility of allowing confusions between a particular category and the combination of two or more categories.
Therefore, in what follows, we will assume that an accurate reference and a set of quality specifications are given. The goal of this work is to propose a set of statistical tests for determining whether a set of quality requirements are fulfilled. This addresses the question: Does the product agree with the reference, according to the set of established specifications? Conclusions about the thematic quality derived from the proposal are not directly comparable with those obtained from the classical methods. The reason is obvious: the statistical foundation is different. As will be seen, our approach is more restrictive and demanding, as it involves the fulfillment of a set of quality specifications by columns and/or globally. In this sense, our proposal offers a new way to deal with thematic quality control.
To achieve this objective, Section 2 presents the statistical base needed to deal with classification data when the reference data are accurate. Section 3 presents four statistical tests and their use for thematic quality control, in the sense of agreement with the set of specifications of the product. Section 4 presents an example of applications of the tests, based on an example of classification data and a set of specifications defined for this purpose. A discussion about the statistical tests and results is presented in Section 5 and, finally, some general conclusions are included in Section 6.

Statistical Foundation of Proposal
A classification data set is a set of values accounting for the degree of agreement between paired observations in k classes/categories of a controlled data set (CDS), and the same k classes/categories of a reference data set (RDS). The usual way to summarize them is by using a contingency table. Table 1 shows an example with four classes, where the reference is located by columns and the product (or data classification) is located by rows. Having an accurate reference alters the statistical foundation when managing classification results. The reader is referred to [13] for a better understanding of the role of an accurate reference. For the sake of completeness, we briefly summarize this key point: Suppose the accurate RDS is located by column. If the reference data are considered as the truth, the total number of elements we know that belong to a particular category can be correctly classified or confused with other categories, but will always be located in the same column and never in another different column (category). So, the fact of having fixed the column marginal (and not randomly), from a statistical point of view, invalidates the background model of a global multinomial (for the entire confusion matrix), but allows us to deal with the classifications by columns as independent multinomial distributions. In [13], such a situation was called a Quality Control Column Set (QCCS). Figure 1 applies this idea to Table 1. The QCCS approach provides a new perspective, which allows (i) carrying out controls centered on categories; (ii) establishing quality levels on each category; and (iii) establishing limits to the presence of confusions between categories. To illustrate this idea, let us consider the set of specifications given in Table 2. These specifications are associated with classification data in Table 1 and are chosen only for the purpose of providing an example. As can be seen in Table 2, we have supposed high percentages of correctly classified items (from ≥70% to ≥85%), a range of percentages in some misclassifications (from ≤10% to ≤20%), and some other, almost non-existent misclassifications (around ≤2%). Note that it has been decided to group the classes "Grazing Land" and "Vegetation" into a single class, "Grazing Land/Vegetation". Part of the misclassification levels group two or more categories. To our knowledge, this type of thematic quality control is possible only when using an accurate RDS; that is, under a QCCS.

Notation for the Application of QCCS
In this section, we recall the mathematical notation underlying a contingency table and under a multinomial distribution. Let Γ 1 , . . . , Γ k be the labels of the k true categories in the RDS and G 1 , . . . , G k be the labels of assigned categories in the CDS by the classification process. The most frequent number of categories, k, is between 3 and 7, although there are matrices that reach up to 40 categories [17]. Usually, RDS and CDS are located by columns and by rows, respectively. Thus, the contingency table is a k × k squared matrix when the adscription of an item into the cell (i, j) implies that an element belonging to category j of the RDS is classified as belonging to category i of the CDS, and n ij indicates the number of items assigned into the cell (i, j), i, j = 1, 2, . . . , k. The diagonal elements contain the correctly classified items in each category, and the off-diagonal elements contain the number of confusions between categories. Table 3 shows the corresponding notation. Recall that, for each category Γ j , j = 1, 2, . . . , k, we will call the class that gives the name to the category the "main class", and the "rest of the classes" could be any of the rest of the initial classes or could be a new class obtained after merging two or more initial classes belonging to the rest. This fact provides us with the possibility of carrying out a more specific thematic quality control. For example, for the class "Woodland" in Table 2, the expert is more worried about a misclassification with both "Grazing/Vegetation" (Sp-W#2) than a misclassification with "Bare Area" or "Urban" (Sp-W#3 or Sp-W#4).
For each category Γ j , we denote by m j the number of elements to be classified and define X 1 j as the number of elements correctly classified in the main class j; and X ij , i = 2, . . . , q j ≤ k, as the number of misclassifications between the main class and the rest of the classes. If, for category j, the rest of the classes are the initial ones, then q j = k; otherwise, if two or more classes of the rest are merged, then q j = k − 1 or k − 2, and so on, depending on the number of merged classes. In this way, each vector X j = X 1 j , X 2 j , . . . , X q j j is modeled by a multinomial distribution with parameters m j and probability vector π j = π 1 j , π 2 j , . . . , π q j j , j = 1, 2, . . . , k. The probability vector represents the probability of proper classification of the jth main class and the probabilities of misclassification between the jth main class and the rest of the classes. After the classification procedure, and following the same indices in the notation, the m j elements in Γ j are allocated as the following the observed frequencies n 1 j , n 2 j , . . . , n q j j , j = 1, . . . , k. Note that if all the classes belonging to "the rest of the classes" (e.g., Sp-U#2) are merged, a binomial distribution is obtained (q j = 2).
As will be seen in later sections, our proposal of an exact multinomial test will require an order in the probabilities. Therefore, in what follows, we will assume one order in the misclassifications according to the set of specifications. Such an order is established "ad hoc" by the expert and its objective is to state priorities in the misclassification levels. As an example, the assumed order in Table 2 means that, for the category "Woodland", the specification Sp-W#1 is the most important to be fulfilled; after that, the specification Sp-W#2; after that, Sp-W#3; and, finally, the specification Sp-W#4. Any other order could be considered, however.
The idea of merging classes is motivated by the set of specifications. Such specifications represent the minimum levels of correct classifications for the main classes and the set of maximum levels of misclassifications between the main classes and the rest of the classes. These values provide the base of several testing problems that allow us to determine statistically whether the assessment results achieve the thematic quality levels previously stated. To follow the same line in the notation, the set of specifications established for the product (Table 2) can be also displayed in terms of a set of columns which we call a QCHS (quality control hypothesis set). Of course, there is a one-to-one relationship between the elements of a QCCS and a QCHS and, column by column, the values of a QCHS are considered as the fixed values of the probability vector of the corresponding multinomial distributions (or even the parameter of the binomial distribution, if the needed classes are merged). So, the particular value of each probability vector π j stated in the specifications is denoted by π 0 j , j = 1, .2, . . . , k. Figure 2 presents the notation used to refer the elements of the QCCS and QCHS for any category Γ j , and for the particular case of the category "Woodland", following the specifications in Table 2 involving the merging of two classes. Overall, we have defined a set of k independent multinomial (or binomial) distributions and proposed values for these probability vectors through a set of k quality levels.

Proposed Tests for QCCS
As stated previously, the goal of this work is to present a set of statistical tests that can be applied over a QCCS to perform thematic accuracy quality controls when an adequate RDS exists. As the distributions underlying of a QCCS are a set of multinomial/binomial distributions, the hypothesis tests are also based on them and, in particular, on the fixed values for the probability vectors π 0 1 , π 0 2 , . . . , π 0 k which are given by the QCHS. From the classification results (observed QCCS) and the set of specifications (fixed values of QCHS), we wish to contrast whether the product is acceptable in accordance with the declared quality levels or, to the contrary, whether it must be rejected. In this situation, the use of a QCCS and a QCHS allows us to treat the problem of the quality of a thematic product as a set of statistical testing problems, where each column of a QCCS is compared with the values proposed by its corresponding column in a QCHS.
Four tests are presented for this purpose. Some of them focus on the percentage of correct classification in the main classes, while others focus on the whole classification (i.e., including the misclassification levels). The statistical basis of all of them exists and is well-known. Here, we systematize their use for thematic quality control and pay attention to their usefulness, in order to improve the current methodology.
The tests are organized into two groups: • First group: There is a single quality specification for each category, always related to the correctly classified elements (e.g., at least 90% well-assigned elements in class A). Two tests for the main classes in a QCCS are presented in Section 4.1.

•
Second group: There is more than one specification per category, some corresponding to the correctly classified elements and others related to maximum limits in the misclassified elements with the rest of the classes (or, even, in mixtures between some of the rest of the classes). Two tests for the complete QCCS are presented in Section 4.2.
Next, both groups of tests are presented using the notation given for the QCCS and the QCHS, and practical examples of applications and general rules of use will be provided in the following sections.

Binomial Tests
Suppose that, for each category, only the number of well-assigned elements is of interest. Consider X 1 j , the variable "number of correctly classified elements in the category Γ j ". This variable follows a binomial distribution with parameters m j , the total elements in the category Γ j , and π 1 j , the proportion of correctly classified elements. Note that each category may have a different expected probability and different sample size.
According to the QCHS, for each j, the minimum percentage of well-assigned elements to be tested is π 0 1 j , such that a binomial test can be carried out based on the observed value, n 1 j , with the null hypothesis: against the alternative hypothesis H 1 j : π 1 j < π 0 1 j (see [18] for details). The p-value associated with each H 0 j , say p j , is obtained as: From this statistical base, we proceed as follows: Step In the same context as the previous case, it is possible to propose an overall test which gathers all the information about differences from the specified proportions in a single quantity. In this case, the single null hypothesis to be tested is: versus the alternative that at least one of the equalities is not true, H 1 : ∃j : π 1 j π 0 1 j , j ∈ {1, 2, . . . , k}. To achieve this goal, taking advantage of the approximation of the binomial distribution to the standard normal distribution, we consider the following test statistic: Under the null hypothesis, T is distributed as a chi-squared distribution with k degrees of freedom. T takes values which are positive or zero and, under the null hypothesis, T = 0 and, so, H 0 is rejected at the significance level α for large values of T; that is, if the p-value p = P χ 2 k > T obs < α, where T obs represents the observed value of T.
This single test has the advantage of avoiding a MMHT. Note that the null hypothesis can be rejected if the relative frequencies are worse than those which are proposed, but also if they are much better. Therefore, if the null hypothesis is rejected, in a second step, we detect which category (or categories) are responsible for the rejection; that is, for which category j, the value of Z j = is higher and what is its sign. A negative sign means that the observed percentage of correctly classified elements in the main class is lower than the one established in the corresponding value in QCHS and, so, the category does not verify the quality level. On the contrary, if the sign is positive, for the category j, the quality level is exceeded. In short, a brief summary of these tests is: (1) The null hypothesis in test 4.1.1 establishes inferior limits for the percentage of correctly classified elements (we have k testing problems, and globally make a decision with a MMHT); whereas, in test 4.1.2, the null hypothesis is the equality of all the percentages (all together). (2) Test 4.1.1 is more powerful than test 4.1.2, in the sense of rejecting the null hypothesis when it is false (that is to say, it can detect small deviations from the null values of quality for the correctness in the classification for each category). (3) If the null hypothesis of test 4.1.2 is rejected, the sign of each Z j informs us about whether the observed percentage of correctly classified elements is greater/lower than that in the null hypothesis for the category j. Furthermore, it could be the basis for a posterior new hypothesis test, based on the fact that Z 2 j is distributed as a chi-square variable with one degree of freedom (similarly to the ANOVA multiple range tests)

Second Group: Tests for the Complete Classification Results
Two tests are proposed to determine statistically whether the observed classification results agree with the complete set of quality levels defined previously (for the main class and for the others).

χ 2 Global Multinomial Test
Given the observed classification results (QCCS) and the summarized quality levels (QCHS), testing whether the QCCS and the QCHS agree is equivalent to testing if all the minimum quality levels for well-defined elements (e.g., at least 90% well-assigned elements in class A) and all the limits to the maximum percentage of mixture between the main class and the rest of the classes (e.g., not more than 4% of allowed confusion between class A and B) are fulfilled or not. This goal is achieved by testing whether the probability vectors of the k independent multinomial distributions (or even binomial distributions) in a QCCS agree with the quality levels specified in a QCHS; say, if π j = π 0 j , ∀j or, equivalently, π 1j , π 2j , . . . , π q j j = (π 0 1j , π 0 2j , . . . , π 0 q j j ), ∀j. In other words, all the probability vectors are those specified in the QCHS. Thus, the null hypothesis is now as follows: versus the alternative H 1 : ∃j : π j π 0 j , j ∈ {1, 2, . . . , k}. To test H 0 , we take the following test statistic: whose distribution under the null hypothesis is a chi-squared variable with v = k j=1 q j − 1 degrees of freedom [18], when q j ≥ 3, for all j = 1, . . . , k. For each column j, when merging classes provides a binomial distribution instead of a multinomial distribution, the corresponding term in the sum must be changed by the following one: Finally, H 0 is rejected at the significance level α for large values of T ; that is, if the p-value, p = P χ 2 v > T obs < α, where T obs represents the observed value of T'. If the null hypothesis is rejected in a posterior analysis, the test permits us to know for which category (or categories) the null hypothesis is rejected as, for a fixed j, we get: An additional partial null hypothesis can then be stated (i.e., H 0newj : π j = π 0 j ) and can be solved in a similar manner, as stated previously for the second step of Section 4.1.2.

Multinomial Exact Tests
This test is also applicable when a complete QCHS is tested. In contrast to the previous one, in this test, we make use of probability vectors formed with an assumed order (in all the categories). This order is related to the specifications in QCHS and the preferences among them.
The criterion we will adopt is that the set of specifications for a category j is not fulfilled when: (i) the percentage of elements correctly classified in the jth main class is lower than that specified in the jth column of QCHS; or, (ii) the percentage of elements correctly classified in the jth main class is equal to that specified in the jth column of QCHS and the percentage of misclassification with the second class is greater than the corresponding one in the jth column of QCHS; or, (iii) the percentage of correct elements in the jth main class and the percentage of misclassification with the second class are equal to those indicated in the jth column QCHS and the percentage of misclassification with the third class is greater than the corresponding one in the jth column of QCHS; (iv) and so on.
According to the criterion stated, and given that each X j follows a multinomial distribution, an exact test for testing Equation (9) is proposed. Such type of test is obtained ad hoc.
From this statistical base, we proceed as follows: Step 1: State the k null hypotheses to be tested, H 0 j , j = 1, 2, . . . , k, and obtain the corresponding p-values, p 1 , . . . , p k , by means of exact multinomial tests.
Step 2: Making use of independence among them, decide whether (globally) all the null hypotheses are fulfilled by using a MMHT.
The exact test for Equation (9) implies calculating the empirical probability of getting an outcome different from the null hypothesis as the outcome observed in the data, so for any fixed category Γ j , the p-value is computed by summing the probabilities of feasible outcomes in M(m j , π 0 j ) (say, P X j = x j ) under the alternative hypothesis. So, we have to determine the feasible outcomes under the alternative hypothesis by establishing when a classification result is considered better or worse than the other, following steps (i) to (iv).
Finally, the p-value is the sum of these probabilities; that is, for each j: where P 0 j denotes the probability mass function under H 0j (i.e., that corresponding to the M(m j , π 0 j )) and X j ≤ m Y * j stands for the set of feasible outcomes (see [13]). Appendix I in [13] presents an example of the calculation of the p-value of the exact test. It is interesting to notice that [20,21] also applied this exact test for positional accuracy quality control by considering error tolerances.
A brief summary of these tests is as follows: (1) Test 4.2.1 is more demanding than the test 4.1.2, as the former involves the whole multinomial, whereas the latter involves only the diagonal percentages. So, the quality requirements to be fulfilled are different. (2) Although the test statistics are based on a chi-square variable in both cases, one is obtained from a binomial distribution and the other from a multinomial. (3) The goal of each test is different. In test 4.1.2, the goal is the fulfillment of a set of quality requirements related to the percentage of correctly classified elements, while test 4.2.1 involves not only the correctly classified elements but also the misclassifications.

Example Applications
The proposed four tests are applied to the data of Table 1. The data come from [16] and were used for the thematic accuracy control of a 5-class crisp Boolean classification. The study area (around the port of Tripoli, Libya) was divided into 21 blocks and the RDS was collected by a field survey, where a simple random sample of 10 locations was taken in each block (approximately 100 km 2 in size). Each sample location represents a 30 × 30 m 2 pixel where the overall land cover was determined.
For our purposes, the first assumption is that the RDS used in this assessment does have the quality to be a reference for our new approach of thematic accuracy control. Therefore, the new structure QCCS (Figure 1) has to be considered, instead of the classical method based on the confusion matrix (Table 1).
Taking into account the specifications of Table 2, the observed classification results from Figure 1 are rewritten, in Figure 3, after merging some classes. The corresponding QCHS is shown in Figure 4, which refers to the minimum proportions for correctly classified elements (for each category/column) and the maximum proportions of misclassifications (for each category/column).   Table 2. B, "Bare Area"; G/V, "Grazing Land/Vegetation"; U, "Urban"; and W, "Woodland".
In the application of the four tests proposed, the indices 1-4 correspond to the categories 1 = B, 2 = G/V, 3 = U, and 4 = W.
Next, we apply the tests described in the previous section, in the same order and using the same data set and specifications.

Binomial Tests
In this case, we are only interested in the thematic accuracy of correctly classified elements, which means working with the first value of each column in the QCCS and QCHS shown in Figures 3 and 4. The four null hypotheses are expressed, using (1), in the following way: H 01 : π 11 ≥ 0.85, H 02 : π 12 ≥ 0.70, H 03 : π 13 ≥ 0.80, H 04 : π 14 ≥ 0.70.
To apply the binomial test for proportions in each category means, the number of correctly classified elements follows a binomial distribution, say: Given the observed values of the number of correctly classified elements in each category, n 11 = 18, n 12 = 66, n 13 = 27, n 14 = 27, the corresponding p-values for each null hypothesis are (found using the function pbinom in the R language [22]): Individual decisions can be taken. In this case, all specifications for the proportions of correctly classified elements are clearly fulfilled for each category, except for the category "Urban". Furthermore, the global decision will be taken using the BM and we will reject at the level α = 0.05 if at least one p-value is less than α/4 = 0.0125. As such a condition occurs, for this example, the minimum level of elements correctly classified is not achieved globally for all categories.
In this case, the observed value for T is:  (11) and the p-value is p = P χ 2 4 > 16.0233 = 0.0111 < 0.05. In consequence, for this example, we reject H 0 at the level α = 0.05. This means the requirement about the levels of elements correctly classified is not fulfilled at 5% of the significance level.

χ 2 Global Multinomial Test
In this case, we are interested in testing the complete fulfillment of all specifications, which means working with the complete QCCS and QCHS shown in Figures 3 and 4.
The null hypothesis is: Note that the degrees of freedom are v = 4 j=1 q j − 1 = 9, with q 2 = 3, q 3 = 2 and q 1 = q 4 = 4, and the corresponding p-value is p = P χ 2 9 > 27.5194 = 0.0011. As a consequence, we reject the null hypothesis H 0 and the set of quality levels expressed throughout the values in the QCHS is not fulfilled (involving both the specifications for the correctly classified elements and those for the misclassifications).

Multinomial Exact Tests
As in the previous case, we are interested in testing the complete fulfillment of all specifications, which means working with the complete QCCS and QCHS, but also applying several exact multinomial tests-one for each category (column).

•
Grazing Land/Vegetation. In the same line, from Figure 4 (column G/V), we get the vector X 2 = (X 12 , X 22 , X 32 ) which follows a multinomial distribution, M 99, π 0 2 , with π 0 2 = (0.70, 0.20, 0.10). From the observed classification (66, 22, 11) (see Figure 3, column G/V), the p-value is 0.2295 (calculated according to Appendix I in [13]). • Urban. In this case, from Figure 4 (column U), the maximum percentage of misclassified elements is 20%, so the classification results lead us to define a binomial distribution with parameters: the number of elements to be classified (m 1 = 46) and the probability of correctness in the classification is p = 0.80; in short, we get B(46,0.80). The observed classification is X 13 = 27 (see Figure 3, column U) and the p-value obtained is 0.0007 (calculated, for instance, by means of the pbinom function in R [22]).
The individual exact multinomial tests reveal it is only the null hypothesis for the category "Urban" that is rejected. For the rest of categories, we do not reject the null hypotheses, so, it means that for such categories, the set of specifications in Table 2 is achieved. In addition, after applying a MMHT (for example, the BM, as not all the p-values are greater than α/4 = 0.0125, it cannot be concluded that, globally, all the specifications involving the quality of the product are fulfilled, as stated in Table 2.

Discussion
This section presents the discussion, divided into two parts. The first is focused on the results of the example that has been presented, and the second mainly focuses on comparing the proposal of tests based on the QCCS with the methods and indices based on the whole confusion matrix; in this case, the argument is qualitative, as methods based on the confusion matrix cannot be applied to the presented example.
In terms of the example, we considered a variety of cases: (i) at category level, from the initial multinomial case (Urban category) to a new definition of the category "Urban" after collapsing Grazing Land, Vegetation, Bare area and Woodland (binomial case) and (ii) at class level, we allowed the merging of classes in the classified categories (e.g., Grazing land and Vegetation for the categories Bare area and Woodland). This is an example of the flexibility of the proposed approach. The four tests here presented offer the user a variety of statistical tools that can cover different quality controls. The first two tests focused on the correctly classified elements and the quality control related to this topic, while the latter two tests focused on the fulfillment of a complete set of specifications. Furthermore, users should pay attention to the type of null hypothesis to be tested. The tests in 4.1.2 and 4.2.1 involved a single null hypothesis, whereas the other two tests involved a set of null hypotheses and the application of a MMHT. Nevertheless, the statistical calculations to be carried out are simple. The only test that presents some complexity is the last one, as, being an exact test, it requires specification of the entirety of the possible solution space. In any case, this can be easily calculated with a script written in the R Language, as shown in Appendix I (see [13] for details).
In addition, it is important to take care with the relationship between the sample size and the power of any hypothesis test. If one increases the sample size, the hypothesis test gains a greater ability to detect small effects. However, larger sample sizes cost more money. And, there is a point where an effect becomes so minuscule that it is meaningless in a practical sense. Considerations about the power involve that the desired values of the significance level and the minimum significant difference that we want to detect have to be previously established. As a first approximation, we recommend to use formulae for the binomial case (see [18] for details).
Concerning the methods and indices based on the confusion matrix, comparison between those indices and the tests proposed does not make sense, as the distributional hypothesis is different and, hence, the way to deal with the classification results differs. Furthermore, the classical indices based on a confusion matrix are accompanied by a confidence interval, whereas, in our proposal, the inference is made in terms of hypothesis tests and refers to the adequacy (or not) of all categories with respect to previous quality specifications. The first two tests (cases 4.1.1 and 4.1.2) focused on the proportion of correctly classified elements. In this case, one can see a subtle relation to the classical inference about the producer's accuracy and conditional kappa (producer's) indices. The last two tests (4.2.1 and 4.2.2) represent a novel idea in thematic quality control. They involve the fulfillment of a set of quality specifications, by columns and/or globally. In summary, the proposed approach offers a new way to deal with thematic quality control.
As a guide to characterize and facilitate the application, Table 4 presents a summary of the four proposed tests. A single null hypothesis is tested. All the specifications are taking into account. It is based on an asymptotic approximation to the chi-square distribution. For a good approximation, it is suggested to have a sample size by column greater than 40 and expected frequencies greater than 5 by each cell.

Multinomial overall test Yes
A null hypothesis is tested for each category. All the specifications are taking into account. It is an exact test and needs an implementation in a programming language. For a global decision, an MMHT has to be used.

Conclusions
A new framework has been proposed for the thematic accuracy quality control of spatial data products. The presented approach is different from the traditional one based on a confusion matrix, where the RDS and the CDS are simply two data sets. The availability of a RDS of higher accuracy than the data to be evaluated is the critical and differentiating aspect of the proposed approach. The application of this new proposal also entails the opportunity and need to establish more complete specifications for the products, such that they can not only establish the minimum required quality level for each category, but also limit the maximum misclassification level between categories.
This perspective is centered on quality control and is supported by well-known statistical hypothesis testing methods, the objective of which is to make a decision about the fulfillment or not (acceptation or rejection) of a set of thematic quality specifications. This perspective allows class-by-class quality control, including some degree of mixing or relaxation (confusions between classes). The four presented cases comprise a very flexible set of statistical tests, where a specific application can be selected depending on the specifications to control. As an example, this flexibility allows us to set different quality specifications for each class, as well as restrict the contrasts only for a subset of the classes. In general, the statistical implementation of the proposed tests is not difficult; only the multinomial overall test is a little more complex, as it is an exact test.
As we do not address an estimation problem, only simple random sampling (SRS) is required and, besides, the sample size does not play a relevant role in assuring the significance level of the tests proposed. The required SRS permits us to define the multinomial (or binomial, if needed) distribution. This is a crucial assumption in hypothesis testing procedures. Other sampling designs (e.g., stratified sampling, clustering sampling, and so on) are common when the goal is to carry out estimation.
From a methodological point of view, given that the RDS is in columns and the CDS is in rows, the tests also work if the SRS changes from columns to rows. However, the meaning of these tests and their conclusions change radically. In our opinion, only by columns make sense (the quality requirements of the product agrees with reality, not the opposite). This point of view is analogous to that of the producer's accuracy.
With respect to the developed example, it has been proved that the calculations are not complex and that all the results of the tests are congruent with each other. Finally, as has been shown, the proposed approach is directly applicable to crisp classification, but we also consider that it is extensible to fuzzy classification. This is the next challenge we are going to address.