Next Article in Journal
A New Orbiting Deployable System for Small Satellite Observations for Ecology and Earth Observation
Previous Article in Journal
Automatic Forest DBH Measurement Based on Structure from Motion Photogrammetry
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Finite Mixture Models in the Evaluation of Positional Accuracy of Geospatial Data

by
José Rodríguez-Avi
1,*,† and
Francisco Javier Ariza-López
2,†
1
Departamento de Estadística e Investigación Operativa, Universidad de Jaén, 23071 Jaén, Spain
2
Departamento de Ingeniería Cartográfica, Geodésica y Fotogrametría, Universidad de Jaén, 23071 Jaén, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Remote Sens. 2022, 14(9), 2062; https://doi.org/10.3390/rs14092062
Submission received: 31 March 2022 / Revised: 20 April 2022 / Accepted: 22 April 2022 / Published: 25 April 2022

Abstract

:
Digital elevation models (DEMs) are highly relevant geospatial products, and their positional accuracy has demonstrated influence on elevation derivatives (e.g., slope, aspect, curvature, etc.) and GIS results (e.g., drainage network and watershed delineation, etc.). The accuracy assessment of the DEMs is usually based on analyzing the altimetric component by means of positional accuracy assessment methods that are based on the use of a normal distribution for error modeling but, unfortunately, the observed distribution of the altimetric errors is not always normal. This paper proposes the application of a finite mixture model (FMM) to model altimetric errors. The way to adjust the FMM is provided. Moreover, the behavior under sampling is analyzed when applying different positional accuracy assessment standards such as National Map Accuracy Standards (NMAS), Engineering Map Accuracy Standard (EMAS) and National Standard for Spatial Data Accuracy (NSSDA) under the consideration of the FMM and the traditional approach-based one-single normal distribution model (1NDM). For the NMAS, the FMM performs statistically much better than the 1NDM when considering all the tolerance values and sample sizes. For the EMAS, the type I error level is around 3.5 times higher in the case of the 1NDM than in the case of the FMM. In the case of the NSSDA, as it has been applied in this research (simple comparison of values, not hypothesis testing), there is no great difference in behavior. The conclusions are clear; the FMM offers results that are always more consistent with the real distribution of errors, and with the supposed statistical behavior of the positional accuracy assessment standard when based on hypothesis testing.

1. Introduction

Positional accuracy has always been considered a defining and essential element of the quality of any geospatial data [1], as it affects factors such as geometry, topology, and thematic quality; it is directly related to the interoperability of spatial data [2]. Considering the widespread use of geospatial information and the interoperability requirements of different geomatics applications and spatial data infrastructures (SDIs), it is crucial to ensure information quality, as this is the only means of guaranteeing reliable solutions when making decisions [3]. A particular case of geospatial data is that of digital elevation models (DEMs). Currently, there are numerous technologies (GNSS, LiDAR, InSAR, etc.) [3,4], which allow the generation of DEM data products with very diverse characteristics (numerical precision, spacing, grid storage, etc.) [3,5]. DEMs are a key data type for many applications domains because they provide the height component in GIS analysis, the geomorphological description of the land [6], which is a reference surface for all hydrological applications (water cycle, erosion, floods, etc). In [7], the basis for the development of forestry models [8] and the base for agricultural parcel rating [9] is useful in every analysis task related to civil engineering [10]. DEMs are part of the information infrastructure to achieve the Sustainable Development Goals and are considered as Global Fundamental Geospatial Themes by the United Nations [11]; they are also included in the list of geospatial themes of the European Spatial Infrastructure [12]. The data model most used in the case of DEMs is the grid [13,14]. Usually, in the case of gridded DEMs, the evaluation of positional accuracy is limited to the errors in the altimetric component (elevation/height) (Case 1D). This 1D perspective is of interest in this document, since, without loss of generality, it allows a simpler approach to the proposed method. The positional accuracy in DEMs has a direct influence on elevation derivatives such as slope, aspect and curvature, and generates erroneous drainage network or watershed delineation [15,16]. Vertical positional accuracy requirements depend on the scale and specific use case; in this line, [15,17] present indicative accuracy values for some usual DEM applications.
Positional accuracy assessment methods (PAAMs) are standardized processes to either estimate or control the positional quality [18] of geospatial data. The PAAMs understand the quality of the data product as the presence of errors with a limited size (e.g., lesser than a tolerance value for the bias or for the dispersion). The accuracy estimation consists of determining a reliable value of the property of interest (e.g., mean bias, standard deviation, proportion, etc.), in the data product. These methods provide a value and its corresponding confidence interval as a result (e.g., a mean value and its deviation such as 5.27 m ± 0.15 m). On the other hand, quality control involves deciding whether or not the property of interest in that data product reaches a certain quality level. These are intended to provide a statistical basis for making an acceptance/rejection decision as a consequence of compliance/noncompliance with a specification (e.g., given the specification that no more than 5% of the elements present 1D-positional errors greater than 1 m, a decision is made to accept/reject according to the evidence found in the sample). In this sense, specific recommendations for the positional assessment of DEM can be observed in [18].
Acquisition technologies used in the positional accuracy assessment, such as Global Navigation Satellite Systems (GNSS) and LiDAR systems, enable the collection of coordinates in the field with high accuracy, which increases the possibility of more accurate positional accuracy assessments. Moreover, PAAMs have evolved over time, from the National Map Accuracy Standard (NMAS) [19] to the more recent by the American Society for Photogrammetry and Remote Sensing, called the Positional Accuracy Standards for Digital Geospatial Data [20], in which the statistics are based on the National Standard for Spatial Accuracy (NSSDA) [21]. It should be noted that these PAAMs apply to both planimetric control (2D-error data) and altimetric control (1D-error data). It is interesting to analyze these three PAAMs, as they present different and complementary perspectives. The NMAS can be considered a method with capabilities to work with free-distributed data [21]. This standard sets out a method of positional accuracy control that establishes an acceptance/rejection rule in a very simple manner, and is based on the binomial distribution applied to error counts. This standard is outdated, however, as it refers to tolerances defined on paper, that is, to the representation scale, but its conceptual basis can be applied to any tolerance value. The Engineering Map Accuracy Standards [22] assumes that positional errors are normally distributed and proposes a set of statistical hypothesis tests that must be overcome for the product to be accepted. Specifically, it establishes two statistical tests per component, one focused on the detection of biases (Student’s t-test) and the other on the behavior of dispersion (Chi square test). Finally, the NSSDA assumes the normality of the error data and is not a positional accuracy control method, as it does not establish acceptance or rejection; the result is a value and, therefore, is an estimation method.
The normal distribution function remains the theoretical base model for some widely used PAAMs (e.g., for the EMAS and the NSSDA) because it is a suitable distribution for representing real-valued random variables generated purely at random. In fact, what is desirable when working with measurement errors is their normal distribution, as this implies that there are no other unknown causes—which are therefore uncontrollable—that affect the measurement result. But, in practice, it is hard to find error data sets that, strictly, could be adequately modeled with one normal distribution. This circumstance has been highlighted specially for the case of DEM [23,24]. This can be due to various causes that can appear alone or together (e.g., many extreme values, overlap of several processes, elimination of data, distribution of values closes to zero or the natural limit, and so on). For these reasons, alternatives based on robust statistics [25]), nonparametric models such as the observed distribution [26], on error counting [27] or percentiles [28], among others, have been proposed. Therefore, we have chosen the case applied to DEMs because it offers a situation where the non-normality of the errors has already been indicated in previous studies and because dealing with 1D errors is a simpler situation than the case of 2D errors, which makes it easy to explain.
In this paper, we explore the case when, even assuming underlying normality, errors come from different normal distributions, that is to say, normal distributions with different parameters. In this case, an approach based on the use of Gaussian finite mixture models (FMM) is adequate for obtaining a whole parametric model that reproduces the empirical distribution of observed data [29,30,31,32]. This approach to the problem is chosen because the FMMs are nothing more than the extension of the traditional model based on a one-single normal distribution. This offers the user a familiar framework with the advantages of a parametric model for statistical inference questions. In addition, FMMs offer enough robustness and adaptability to particular distributions that can demonstrate the very varied possible use cases.
In this work, a double objective is pursued. Firstly, to study the distribution of the estimators in the sampling under the FMM, which allows proposing specific hypothesis tests for the fitted model, and secondly, to apply this study to various positional accuracy standards (specifically NMAS, EMAS and NSSDA) , for which the theoretical framework is defined, the procedure is developed and it is verified how its use improves the results obtained under the assumption of a single normal distribution. Therefore, our ultimate goal is just to propose a parametric model that can replace the normal univariate statistical model (widely accepted and applied) and that can be used in all cases that are required, but not to develop a new model (theoretical or empirical) for the uncertainty or new specific indices for the evaluation of positional accuracy.
After this section, the conceptual bases of the finite mixture model are presented. In Section 3, an overview of the methods is presented, which includes the adjustment process of the FMM and the simulation process to analyse the behavior when applied to the selected PAAMs. Section 4 presents the data; these are altimetric discrepancies from two digital terrain models. Section 5 shows the results obtained and the application to the different standards.It is long because it presents the results of the FMM adjustment process and also of the simulation process for the three PAAMs under analysis. The Section 6 and Section 7 are devoted to presenting the discussions and conclusions.

2. Finite Mixture Models

This article proposes the application of the finite Gaussian mixture model methodology to fit a set of measurement errors. A detailed analysis can be observed in [29,30,31,32] and may be summarized as follows:
  • Let the vector of observed errors X = ( X 1 , , X n ) , a random sample that comes from a mixture of g > 1 distributions Φ i = N ( μ i , σ i ) , i = 1 , , g , in the way that each of which appears with a proportion π i in the mixture, π 1 + + π g = 1 . Then, the value of the density function of each X i is given by:
    f θ ( x i ) = j = 1 g π j ϕ j ( x i ) ; x i R
  • Which implies estimating the vector of parameters
    Θ = ( π 1 , μ 1 , σ 1 ) , , ( π g , μ g , σ g )
    of dimension 3 g .
  • The estimation of Θ (2) is made with the E M algorithm [30,33,34,35], which is obtained iteratively through the operator
    Q θ | θ ( t ) = E log h θ ( C ) | x , θ ( t )
    where θ Θ , θ ( t ) is the value of the iteration t and the expectation refers to the distribution of k θ ( c | x ) of c given x for the value θ ( t ) of the parameter.
  • In this way, g groups are calculated. The posterior probability of pertaining to the group i , i = 1 , , g is given by
    π ^ i j = π ^ i f i x j | ( μ ^ i , σ ^ i ) k = 1 g π ^ k f k x k | ( μ ^ k , σ ^ k ) ; x j R , i = 1 , , g ; j = 1 , , n
    and each sample point x j is assigned to the group where π ^ i j is maximum.
  • The final density function is:
    f ( x j ) = i = 1 g π ^ i j
    where π ^ i j are obtained in (4).
In order to determine the best value of g (the final number of mixing distributions), the use of some information criteria to choose the best fitted model is proposed. In this case, they are the Akaike Information Criteria, A I C and the Bayesian Information Criteria, B I C (see for instance [36,37]):
A I C g = 2 L } + 2 p
B I C g = 2 L } + p ln ( n )
where L } is the log-likelihood value in the estimation with g groups and p = 3 g is the number of estimated parameters (2). In both cases, the best value of g corresponds to the one in which the value obtained by A I C or B I C is the minimum. The difference between both measures is the presence in the B I C of the sampling size n in order to correct the criterion value. This criterion penalizes models with a greater number of estimated parameters by replacing the term “ 2 p ” by “ p ln ( n ) ”, thus obtaining models of lower order than those obtained by the A I C , which allows for correcting the tendency to overestimate. To implement the calculations, the package mixtools of R [38,39] has been employed.
Once selected, the theoretical model provides a whole description about the population where data come from, and all population probabilities and parameters can be calculated. In this case:
  • Mean:
    μ ^ = i = 1 g π ^ i μ ^ i
  • Variance:
    σ ^ 2 = i = 1 g π ^ i σ ^ i 2 + i = 1 g ( μ ^ i μ ^ ) 2
    and, in consequence, σ = σ 2

3. Methods

Two well-differentiated parts can be considered:
1.
Estimating of a model based on mixtures (Section 3.1). This step will offer the parameters of the mixing distribution functions (proportions, means and deviations). In this way, a parametric model based on the mixture of normal distributions will be available.
2.
Simulation of the behavior of PAAMs in sampling processes (Section 3.2). By means of the simulation of samples it will be known how the estimates of the variables used by PAAMS (e.g., mean and standard deviation) behave when a parametric model based on a finite mixture of normal distributions is applied in comparison with the traditional approach based on one normal distribution model.
The next two subsections describe these two parts in more detail and set out the proposed methodology for their use.

3.1. Estimating the Finite Mixture Model

The steps for obtaining this model are:
  • To take a sufficiently representative sample.
  • To adjust several mixing models with different finite numbers of mixed normal distributions (e.g., 2, 3, 4 and so on).
  • To determine the “best fitted” mixing model.
In relation to the first step, the utility of the resulting mixing model is depending on the representativeness of the sample used for its adjustment. This work uses the whole error model (discrepancies) of Section 4 dedicated to describing the data. In this way, the representativeness of the results is assured for this area.
The second step consists on selecting g, which is the number of normal mixing distributions that provides the best fit according to A I C or B I C values. Once selected, the third step consist on studying the model density function (5) with the selected value of parameters (2), and to compare it with the observed data.

3.2. Simulation of the Behavior of PAAMs in Sampling Processes

PAAMs (e.g., NMAS, EMAS, NSSDA, etc.) applied to geospatial data products are based on samples from which one or several parameters (e.g., a proportion, the mean, the standard deviation) are derived. Some of these parameters are used for defining a classical one-single normal distribution model approach (e.g., N ( μ , σ ) ), which is used by several PAASMs (e.g., EMAS, NSSDA). Therefore, it is interesting to know the distribution in the sampling of these parameters, and compare their behavior under two approaches: (i) the model based on a one-single normal distribution (1NDM), and (ii) the Finite Mixture Model (FMM), fixed following the process indicated in Section 3.1.
Once the parameters of the FMM have been obtained, we are interested in determining its behavior in sampling processes. In this sense, a Montecarlo simulation will be carried out. The process consists of generating random samples of different sizes and determining their quantiles for each one of them under the two approaches. The considered sample sizes are n [ 20 , 30 , 40 , 50 , 80 , 100 , 200 , 500 ] , and 5000 iterations are performed in the simulation. For each simulation, and for each sample size, the sample mean and variance are calculated, resulting in a vector of 5000 means and standard deviations, which may be considered as a sample of the sampling distribution of the estimators of the model. Through these simulations, the distribution of the estimators is estimated and the quantiles to be used are determined (for example, 5% or 1%). These quantiles will be later used to obtain critical values for tests. Figure 1 shows a general view of this simulation process.

4. Discrepancy Data for the Application Case

In order to simplify the example case, 1 D -positional-error data are used. In any case, the process shown here is valid for all PAAMs that consider the components of the horizontal positional error ( e x and e y ) as one-dimensional normal variables. In this study case, the errors are vertical and the 1 N D M and F M D models will be applied to discrepancy data (errors) obtained in a study area around Allo (Navarra, Spain). It is a mid-mountain area of 504 km 2 , where the elevation varies between 316 and 1046 m; the average elevation is 468 m and the standard deviation of elevations is 92.8 m. A map of the studied area appears on Figure 2.
Discrepancy is derived as the difference between two DEMs:
d i = h D E M , i h R E F , i
where
  • h D E M , i : elevation in position i of a DEM product;
  • h R E F , i : elevation in position i of a reference;
  • d i : discrepancy in elevation in position i.
In this study, the DEM data sets are:
  • R E F (Reference): DEM02. In this case, it is a gridded DEM ( 2 × 2 m resolution). Its primary data source is an aerial LiDAR survey obtained in 2017 (second coverage of the PNOA-LiDAR project https://pnoa.ign.es/estado-del-proyecto-lidar/segunda-cobertura, accessed on 28 March 2022). The informed positional accuracies for the DEM are R M S E X Y 50 cm and R M S E Z 02 25 cm.
  • D E M (Product): DEM05 is a gridded DEM ( 5 × 5 m resolution) that comes from an aerial LiDAR survey obtained in 2012 (first coverage of the PNOA-LiDAR project https://pnoa.ign.es/estado-del-proyecto-lidar/primera-cobertura, accessed on 28 March 2022). The informed positional accuracies for the DEM are R M S E X Y 50 cm and R M S E Z 05 50 cm.
Both data sets can be considered independent in their generation. However, the one used as a reference (DEM02) does not meet the criteria of being a true reference because its accuracy is not at least three times better than that of the product to be evaluated (DEM05). However, this circumstance does not invalidate the proposed procedure and the results obtained from its application.
Both DEM data sets are freely available on the webpage, http://www.ign.es, (accessed on 30 March 2022) of the National Geographic Institute of Spain (IGN), and have the same spatial reference system ETRS89 UTM Zone 30N.
To ensure the overlap of the two grids, and not degrade the quality of the reference (DEM02), the DEM05 data set was interpolated with a 2 × 2 mesh step by means of a bilinear interpolation. Following the variance prediction model for the case of bilinear interpolation developed by [4], considering the equality of all the variances of the four positions that intervene in the bilinear interpolation, and the case of a high altimetric correlation; the average variance of the predictor of an altimetric value over any position is equivalent to the variance of the positions involved in the interpolation. In our case, according to the information provided by the metadata, it can be considered to be of the order of 50 cm.
The points analyzed have been obtained through a systematic sampling, for which a grid of 578 rows and 853 columns was generated, which provides a sample size of n = 493,034. The discrepancies are in the interval ( 54.88 , 77.42 ) m; the mean value of the discrepancies is 0.00062 m and the standard deviation 0.41835 m. A general spatial vision of discrepancies appears in Figure 3. Usually, the values assumed for the discrepancies between a product and a reference must be close to zero, but in this case, the above-mentioned observed interval means the presence of extreme values (outliers). Therefore, these data present some extreme points, both on the left and the right. Moreover, the Fisher asymmetry coefficient is 11.46521 and the Fisher coefficient of kurtosis is 1009.753; both of them are very high in respect to the normal distribution.
Figure 4 shows the data histogram of the complete data set. Due to the presence of a relatively small number of extreme values, and the histogram showing the distribution concentrated around 0, and due to the effect of the scale of the x axis, the values farthest from 0 are not visible. In order to see the shape of the histogram in more detail, Figure 5 shows the histogram constrained to the interval ( 1 , 1 ) , which contains 97.69% of observed discrepancies.
Finally, the overall non-normality of discrepancy data may be also observed in Figure 6, where the QQ-plot is shown together with the expected normal line. These graphics suggest a great deviation of expected normality. This situation opens the possibility that the underlying discrepancy data model comes from a finite mixture of normal distributions.

5. Results

5.1. The Finite Mixture Model

As indicated in Section 3, the decision of the proposed F M M is based on the obtained values for A I C and B I C criteria. Table 1 shows values for both criteria when the number of mixtures, g, goes from 2 to 10. Because the estimation procedure is iterative, to show the complexity of the process, the last column includes the number of iterations needed to achieve convergence.
In this case, and due to the sampling size being very large, the BIC criterion is adopted [37]. According to Table 1, a mixture of seven normal distributions is proposed. Table 2 shows the vector of estimated parameters, Θ ^ , obtained, where ( μ ^ i , σ ^ i ) are the parameters of the i-th normal distribution component and π ^ i the probability of this component in the mixture.
It can be observed that the first component includes all extreme values on the left, and that the seventh component covers the extreme values on the right. Both cases account for a very low probability. The most important is component five (a half of the population, see π ^ 5 ).
The estimated population density can be calculated according to (4) and compared with the empirical distribution of the observed data (EDOD) (see Figure 3). A graphical comparison appears in Figure 7, where the EDOD histogram is represented together with the estimated density (the FMM)—the curve in orange. For visibility, the range is trimmed in the interval [−1, 1]. The maximum distance detected between both curves is 0.00041, which is a really small value and has a p-value greater than 0.1 in the Kolmogorov-Smirnov goodness-of-fit test (the critical value in this case is 1.228 / 493 , 034 = 0.00175 ).
The F M M provides a whole description about the population of discrepancies and allows the calculation of all population parameters and probabilities using Equations (8) and (9). In this case, μ F M M = 0.00062 m, variance σ F M M 2 = 0.17502 m 2 , and standard deviation σ F M M = 0.41835 m. Comparing these values derived from the F M M model with those corresponding to the E D O D (see Section 4), it can be observed that they are the same.

5.2. Comparison of Approaches

Because most PAAMs assume a model based on an only one normal distribution component (a 1 N D M ), it is of worth to compare results derived from the proposed F M M with the 1 N D M and the E D O D . Results for these models are demonstrated in Table 3, where the similitude can be observed between results provided by the F M M model with those of the E D O D model, and that the 1 N D M is very far from them.
With the same idea, and only as an example, Table 4 shows the calculated of probabilities for intervals defined by several values, and compares results obtained using the three models. As occurs on Table 3, it is observed that the 1 N D M has a bad behavior, whereas the estimated F M M fits adequately. In particular, the F M M adequately captures both the high concentration of values around the mean, and the tails of the observed data.

5.3. Analyzing the Sampling Distributions

The advantage of an F M M is that it allows working with a parametric model that describes the entire discrepancies’ population. In order to utilize the model for building a hypothesis test, and in order to accept or reject some assumptions related to the population, it is necessary to know the sampling behavior of the estimators in a sample of size n, which is a collection of n independent random variables, all of them distributed according to the distribution of the discrepancy’s population. If a 1 N D M is assumed, the distribution of the mean and variance of the sample are well known. But in this case, we need to know the sampling distribution under the F M M obtained. To know this sampling distribution, a simulation procedure was carried out, where 5000 samples for different sampling sizes were obtained. Table 5 shows the values for the mean and standard deviation of each set of 5000 samples. It can be noted that the sampling mean is always a random variable with an expected value that equals to μ and standard deviation equals to σ / n . The third column of Table 5 shows the values of σ ^ = s n n , which is very close to the standard deviation of the theoretical model. Additionally, this table shows that the square root of the mean of variances is still a more unbiased estimator for the population standard deviation that the mean of the standard deviation.
In a statistical hypothesis test, the test statistic is compared with the corresponding quantile of its own sampling distribution under the null hypothesis at the desired confidence level (e.g., α = 0.05 ). For instance, in the normal case when the standard deviation is unknown (as occurs in the EMAS test), the sampling distribution of the test statistic T = n ( x ¯ μ 0 ) / s n 1 is a t-Student distribution with n 1 degrees of freedom, which is easily obtained. Something similar occurs in the case of the variance test, where the sampling distribution is a χ n 1 2 . Nevertheless, in the case of the application of a F M M , these quantiles are not known in advance, but they can be obtained through a simulation process. In our case, by means of the above-mentioned simulation process, we were able to determine the quantiles through the 5000 samples generated to derive Table 5. These quantiles appear in Table 6 for the mean, and in Table 7 for the variance. These tables may be used for finding the critical values in the case of the statistical hypothesis test for the mean and the variance.

5.4. Application to Postional Accuracy Assessment Methods

The analyzed PAAMs in this paper are based on the statistical hypothesis test on proportions (NMAS), the mean and deviation (EMAS), but also on the result of estimation processes (NSSDA). These situations are very different, but the F M M can be applied to all of them and it is valuable to compare the result of this application with the results of applying the 1 N D M , which represents the traditional approach. In this subsection, the philosophy of each of these three standards is applied using the F M M , and the results are compared with those obtained, assuming the 1 N D M approach.

5.4.1. National Map Accuracy Standard

There are several PAAMs based on the proportion test; one of the most popular methods is NMAS (Appendix A), but others exist (e.g., [18,40]). Basically, these methods work by setting a metric tolerance and a maximum case ratio value that cannot exceed the proportion. The control sampling is carried out, the number of observations (discrepancies) that exceed that tolerance is counted, and it is verified that the proportion of cases that exceed the metric tolerance is less than the established proportion. If the observed proportion is greater than the tolerance the product is rejected. The application in this case is immediate. Let x H be the desired metric tolerance value, and π H = P [ | X | > x H ] be calculated in the F M M model using Equation (5). Several examples were presented in Table 4, and these probabilities have been used here. The null hypothesis is
H 0 : p π H
and the alternative hypothesis:
H 1 : p > π H
where p is the proportion of sampling discrepancies values that are greater than x H in a sample of size n. Table 8 shows the proportion of times the null hypothesis is rejected (for α = 0.05 ) when M = 5000 samples are taken, and for several metric tolerances (0.01, 0.05, 0.10, 0.15, 0.20 and 0.5) [m] when using the discrepancies between the DEM05 and DEM02. We observe that, in all cases, the test based on the F M M performs better than the test based on the 1 N D M . This means that, for the F M M , the rejection value when H 0 is true (type I error) is closer to the desired value (0.05). This does not occur for the lowest tolerance of those considered and when the sample size is small, but it does for the rest of the cases. In the case of the 1NDM, the values are usually less than 5%, which indicates that its statistical behavior is not as expected. This generates uncertainty in its applicability, as it does not generate the level of rejection consigned. It behaves more laxly than expected.
The results in Table 8 clearly indicate that the F M M performs statistically much better than the 1 N D M . Extreme cases are relevant. For very small discrepancy tolerances (0.01 m), and small sample sizes, we observe that the two approaches ( F M M and 1 N D M ) offer high rejection levels for sample sizes usually recommended in PAAMs (size in the order of 20 elements). However, when the sample size is large (200 or 500), the F M M offers rejection values close to the established level of significance. For the tolerances of very large discrepancies (1 m) it happens that the 1 N D M presents a very high level of rejection. In the cases of intermediate tolerance values, the F M M adjusts its rejection level to the value established for significance (0.05), while the 1 N D M generates practically no rejections. Altogether, this means that the 1 N D M does not work adequately as a statistical model for this case, generating underestimation and overestimation of the producer’s risks (type I error).

5.4.2. Engineering Map Accuracy Standard

The EMAS consist on the realization of two independent statistical hypothesis tests; the first one is for the mean and the second one for the variance (Appendix B). The global null hypothesis is rejected if it is rejected in any of them (test statistics are greater than the corresponding quantile). To compare the 1 N D M and the F M M , M = 5000 simulations have been carried out using the discrepancies between the DEM05 and DEM02, and both tests (mean and variance) have been made.
In relation to the mean test, we must consider two situations in relation to the hypothesis; the first one is: H 0 : μ = μ H and for α = 0.01 , 0.05 , 0.1 , and the second one is: H 1 : μ < μ H , whereas for α = 0.9 , 0.95 , 0.99 , H 1 : μ > μ H , where μ H is the model mean. For the test based on the F M M case, the mean value is compared with the corresponding quantile in Table 6; and for the 1 N D M case, the usual t-Student test has been made. Table 9 shows the proportion of times in which the null hypothesis is rejected, both for the F M M and the 1 N D M .
As in the case of the previous simulations, the results are better the closer they are to the significance values considered (0.01, 0.05, ). The results presented in this table do not indicate a significant difference between the two methods.
In relation to the variance test, the same simulation procedure has been carried out. The null hypothesis is H 0 : σ 2 = σ H 2 . Now, for the F M M , the test is rejected when the test statistics are less than (for α = 0.01 , 0.05 , 0.1 ; H 1 : σ 2 < σ H 2 ) or greater than ( α = 0.9 , 0.95 , 0.99 ; H 1 : σ 2 > σ H 2 ) the corresponding value in Table 7, whereas in the 1 N D M case, the test statistic is χ = ( n 1 ) S 2 / σ H 2 and the critical value is obtained using the χ 2 distribution with ( n 1 ) degress of freedom. The result appears on Table 10.
The results of this simulation are clear and obvious for all cases: the results based on the F M M mixture model are better than those based on the 1 N D M , in the sense that the proportion of rejection for H 0 is quite similar to the expected probability in the case of the F M M and very different for the 1 N D M . In this case, the 1 N D M rejects many cases, which excessively increases the producer’s risk.
Finally, the EMAS requires passing the two tests (mean and variance) together (logical AND condition), which means that the EMAS is rejected if one of the two tests is rejected. Although the EMAS is performed according to a 1 N D M as the underlying distribution, the same philosophy can be applied in the case of the F M M . Table 11 shows the proportion of times than the null hypothesis (that in this case is true) is rejected. A consideration about the EMAS is that it does not contemplate any correction, such as that of Bonferroni, for the fact of combining two independent hypothesis tests simultaneously. Following the suggestion by [41], we introduce this correction. For instance, when the global desired significance is α = 0.10 , Bonferroni’s correction implies, in the case of the bilateral mean test, that the critical value to be considered is α = 0.025 , and for both sides it is ( α / 4 ) . Table 11 shows the result for such corrections when applying the F M M and the 1 N D M . It may be observed that in both cases the Bonferroni’s correction provides a better result, in the sense of the proportion of times the null hypothesis is rejected being nearer to the desired value ( α value). The results of the application of the 1 N D M are worse than those obtained from the F M M .

5.4.3. National Standard for Spatial Data Accuracy

The NSSDA follows a different philosophy than NMAS or EMAS (Appendix C). The NSSDA does not propose a statistical hypothesis test. In this case, the estimation of the value corresponding to the 95% quantile is performed (e.g., 5.25 m at 95% confidence). This result is offered to the interested party (the user), who, based on the estimation, finally has to decide whether or not the data product is suitable for his intended use (fitness for use). Therefore, a value is generated and the user implicitly performs an accept/reject process but not in a statistical acceptation/rejection framework.
From a statistical point of view, a key aspect of this standard is the behavior of the quantile estimation in respect to the sampling size, which can vary from the theoretical value. Afterwards, this quantile can be typified in order to compare it with the 1.96 parameter used in the NSSDA as the expansion factor for the 95% confidence interval when a 1 N D M is assumed. Applying the simulation process described above (Section 4), it is possible to compare the results derived from the three approaches under consideration (Table 12). In this table, the mean of the 97.5 quantiles of each sample and for each value of n is presented (column μ Q 0.975 ). Notice that trend demonstrated by these results is in accordance with those of [41] obtained for the 2D case.
In order to have a reference for comparing the results of Table 12, the asymptotic case ( n ) is used. For instance, to obtain the constant that multiplies the value of M S E z in the case of the 1 N D M , the value corresponding to the 97.5% quantile is 1.96, which is derived as follows:
K 97.5 ( 1 N D M ) = Q 97.5 ( 1 N D M ) μ 1 N D M σ 1 N D M = 0.82057 0.00062 0.41835 = 1.960
The same computation for the F M M results:
K 97.5 ( F M M ) = Q 97.5 ( F M M ) μ F M M σ F M M = 0.81407 0.00062 0.41835 = 1.944
In consequence, when n , in the proposed version of the NSSDA based on the F M M , the 1.96 values are replaced by 1.944. This implies that the limit value for M S E z shall be slightly less than that obtained for the 1 N D M case. Note that when applying this calculus procedure, we can propose the NSSDA standard for different quantiles, not only the 97.5 that is used by the rule. In following with the results of the simulations, the columns Z Q 0.975 in Table 12 show the same for all sample size cases and approaches: the typified values obtained by the simulation process are less than the corresponding value for the population ( n ). This means the presence of underestimation of this parameter, in accordance with previous results [42]. This underestimation of the expansion factor leads to underestimating the value corresponding to 95, that is, the sample results in a lower level of positional error than actually exists in the population, which is a risk for the user. The difference of these values with respect to the theoretical ones ( n ,) has the same range of magnitude for the three models ( E D O D , F M M and 1 N D M ). However, the 1 N D M presents less discrepancy for small sample sizes, and the E D O D and F M M present less discrepancy for the larger sample sizes.

6. Discussion

In relation to the F M M , we can highlight that they are a fully developed and applied statistical tools in other fields; however, we do not have knowledge of their application to the case of spatial data, and even less on the subject of positional accuracy. The application of F M M is not complex, as has been evidenced in the work; in addition, to show a simpler case we have only worked in 1D (elevation discrepancies). However, the model is directly applicable to 2D and 3D cases if the coordinates and their associated errors are considered independently. Since the tools to fit the model exist, and the selection criteria are common (e.g., AIC, BIC), the most critical aspect is the sample size to make a good fit. This size will depend a lot on the data to be adjusted (informational structure); thus, there is no possibility of offering quantitative recommendations. Obviously, the bigger the sampling size is, more accurate the estimation is, especially if the hypothesis of a mixture is true. As a first idea, the sample size should be as big as possible, but an important limitation is given by the obtention cost of the sample.
In any case, it is best to proceed with empirical testing; for instance, by some simulation procedures, we found that sample sizes greater than 2000 produce acceptable results regarding the distance between the obtained model (the fitted F M M ) and the real data (the E D O D ). An interesting aspect that has not been explored in this work is that once the MMF has been obtained, its results may have other applications. For example, through the estimated model, a grouping can be provided, which is intrinsic to the data and that, unlike the cluster analysis, does not need additional explanatory variables, since it is produced by the ascription of each discrepancy case to that one mixing distribution to which it is most likely to belong. These groups can also try to be interpreted using multivariate statistical techniques such as discriminant analysis, logistic regression, etc. In addition, if other variables are available (e.g., slope, aspect, type of terrain and so on), this situation can help to better understand the nature of the mixing distributions (see, for instance, [31,32,43,44,45]). The BIC criterion has led us to select a model with seven components. This model offers a majority component (fifth component with 52% of the weight), three components with weights between 5% and 20% and other very minor components, two of them linked to extreme values (atypical/outlier values in the 1 N D M case). We really do not know if a model with fewer components would work pretty much the same as this seven component model; however, this is not really a problem, because once it is decided to use an F M M type adjustment, its dimension (number of components) is easily managed by means of any statistical tool. For this reason, we consider that the following selection criteria based on BIC offers the same solution, impartial and objective, to anyone who performs the same process on the same data, which allows the method to be standardized.
In relation to the discrepancy data values used in this paper, the analysis carried out comparing the results of the F M M with the 1 N D M and the observed data ( E D O D ) clearly show that the F M M offers much more consistent results with the real population than the 1 N D M . Thus, the difference in values between the E D O D and the MMF is very small in all the cases presented in Table 3, Table 4 and Table 12. Moreover, if the 1 N D M is compared with the MMF and the E D O D , it can be observed that the difference in quantile distance has reached 23% in the case of 90% quantile (Table 3). In the case of probabilities (Table 4), the probability difference between 1 N D M , the MMF and the E D O D in the analyzed intervals has reached 0.3 (case X > 0.2 ), which means 30% of discrepancy. The above two examples are cases of maximum difference, but on average, the difference is also quite a lot. This clearly demonstrates that the 1 N D M model is not suitable for modeling data such as those used in this paper.
Finally, we will pay attention to the results when considering commonly used standards for positional accuracy assessment. In this case, the most important thing is the adjustment to the level of significance, as it is the risk of the producer that is assumed in a statistical process of control. As shown by Table 8 for the NMAS, the F M M performs statistically much better than the 1 N D M when considering all the tolerance values and sample sizes used in the analysis. In the case of the 1 N D M , the values are usually less than 5%, which indicates that its statistical behavior is not “as expected”. Table 11 presents the main results for the case of the EMAS. The first conclusion is the need of a Bonferroni correction when applying the EMAS. For both significance levels (0.05 and 0.1), the rejection level by the F M M is a little less than the prescribed level; the differences are in the order [2.1, 1.5]% (always less). The contrary occurs for the 1 N D M ; the differences are in the order [5.6, 7.4]% (always more). We consider that these differences with respect to the consigned value are really high. In this case, there exists an excess of rejection that harms the producer, with the consequent problems that this can also generate for the user. The NSSDA is not a statistical test, although it can be understood that it considers a process of acceptance/rejection by the user, as the latter must ask himself whether the result of the estimation seems adequate or not for his application. If we consider that this process is based on the simple comparison of values (estimated by the sample versus the theoretical), Table 12 indicates the acceptance for all sample sizes and models, and a very similar behavior of the three approaches is under consideration.

7. Conclusions

We consider that statistical models based on finite mixtures of normal distributions allow a better approximation to actual altimetric errors, as shown by its ability to fit the observed data. The method and the tools for the application of this alternative are already developed, and its application is quite direct. The main limitation of the use of F M M s is the need for large sample sizes to fit the parameters of the mixing distributions. Furthermore, no simple rule can be offered to establish this size. For the application phase of the FMMs using PAAMs, larger sample sizes will be needed, but, in any case, in the order of the previous recommendations for these standards.
The use of the F M M s as the statistical models for the application of the PAAMs analyzed (NMAS, EMAS and NSSDA), generates improvements in the behavior of the results for those standards based on statistical hypotheses tests (e.g., NMAS and EMAS). In this case, the F M M s application offers results with a better approximation to the levels of significance. If the PAAM is not based on a statistical process, as it is here analyzed for the NSSDA, it does not have such a clear advantage.
Since FMM is a statistical model obtained from the numerical values of the errors, it does not necessarily have to be associated, a priori, with an underlying physical model of the soil. Therefore it can be considered as a black box system, which is common for PAAMs of this type. However, a posteriori, the FMM could be used to analyze the spatial distribution of the mixing distributions in order to get a more ground-based interpretation of the error distribution and the reason of its allocation to each component of the FMM. We believe that this could be of great interest if some relationship is achieved with variables that have traditionally been considered to explain the altimetry error (e.g., slope, vegetation cover). We consider this to be a future line of research that could help establish the use of FMMs for DEM error assessment and analysis.
In this paper, the application has been developed for the case of 1D errors, and for this reason we worked with DEMs, but the method is directly applicable to the case of 2D errors, if the X and Y components are considered independently. Let us bear in mind that the proposed method provides a parametric statistical model, which, once estimated, allows us to work through population values. Therefore, its use is not limited to the case of altimetry errors, which is what has been developed here; it is also useful for obtaining probabilistic models in any set of quantitative measurements, such as slopes or the values of heights themselves. This would allow them to be used, for example, to compare between different areas, or even in the same area in different periods of time. Likewise, knowledge of the theoretical model allows its use when proposing more precise and exact contrasts appropriate to the nature of the data.

Author Contributions

Data curation, J.R.-A. and F.J.A.-L.; Formal analysis, J.R.-A. and F.J.A.-L.; Funding acquisition, F.J.A.-L.; Investigation, J.R.-A. and F.J.A.-L.; Methodology, J.R.-A. and F.J.A.-L.; Project administration, F.J.A.-L.; Software, J.R.-A.; Writing–original draft, J.R.-A. and F.J.A.-L.; Writing—review & editing, J.R.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially financed by the research project “Functional Quality of Digital Elevation Models in Engineering” of the State Agency Research of Spain. PID2019-106195RB-I00 /AEI/10.13039/501100011033 (https://coello.ujaen.es/investigacion/web_giic/funquality4dem/) (accessed on 30 March 2022).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sets are freely available on the webpage, http://www.ign.es, accessed on 30 March 2022.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. NMAS

1.
Select a sample.
2.
Calculate the error of each point in each component:.
e x i = x p i x i , e y i = y p i y i , e z i = z p i z i
where:
  • x i , y i , z i are the coordinates in the reference (RDS).
  • x p i , y p i , z p i are the coordinates in the product (ADS).
3.
Calculate the horizontal component of the errors in x, y at each point:
e H i = e x i 2 + e y i 2
4.
Establish which are the maximum tolerable errors:
  • Horizontal: HTol1 = 0.085 cm (1/30 inch) in maps of a scale greater than E20K or HTol2 = 0.05 cm (1/50 inch) in maps at a scale smaller or equal to E20K.
  • Vertical: Half of the equidistance (interval) between contour lines (VTol).
5.
Count how many points have a horizontal error e H greater than the tolerance that applies to the scale case. The control is surpassed in the horizontal component if the number of points having an error above the tolerance does not exceed 10% of the cases.
6.
Count how many points have a vertical error e z greater than the vertical tolerance. The control is surpassed in the vertical component if the number of points that have an error above the tolerance does not exceed 10% of the cases.

Appendix B. EMAS

1.
Select a sample of n points, where n 20 .
2.
Calculate the error for each point in each component:
e x i = x p i x i , e y i = y p i y i , e z i = z p i z i
where:
  • x i , y i , z i are the coordinates in the reference (RDS).
  • x p i , y p i , z p i are the coordinates in the product (ADS).
3.
Calculate the mean error of each component:
e ¯ x = 1 n i = 1 n e x i ; e ¯ y = 1 n i = 1 n e y i ; e ¯ z = 1 n i = 1 n e z i
4.
Calculate the sampling standard deviation in each component:
S x = e x i e ¯ x 2 n 1 ; S y = e y i e ¯ y 2 n 1 ; S z = e z i e ¯ z 2 n 1
5.
Perform, for each component, the standard compliance test to determine whether the mean error is acceptable (which implies an absence of bias). For this, a test is performed on the mean, under the assumption of unknown population variance and establishing the following hypotheses:
H 0 : μ = 0 ; H 1 : μ 0
The map will pass the test with a significance level α if the following is met:
| t x | t n 1 , α / 2 ; | t y | t n 1 , α / 2 ; | t z | t n 1 , α / 2
where:
  • t n 1 , α / 2 Student’s t-distribution value, with n 1 degrees of freedom.
  • t x , t y , t z : Result of calculating the following statistics:
    t x = n e ¯ x S x ; t y = n e ¯ y S y ; t z = n e ¯ z S z
6.
Perform, for each component, the standard compliance test to determine if the sample standard deviation is within acceptable limits. For this purpose, a test is performed on the variance, establishing the following hypotheses in relation to a maximum variance value σ 0 x 2 , σ 0 y 2 and σ 0 z 2 pre-established and specified on each component:
H 0 : σ 2 σ 0 2 ; H 1 : σ 2 > σ 0 2
The product will pass the control with a significance level α if the following is met:
χ x 2 χ n 1 , 1 α 2 ; χ y 2 χ n 1 , 1 α 2 ; χ z 2 χ n 1 , 1 α 2
where:
  • χ n 1 , 1 α 2 Theoretical value of the Chi square distribution, with n 1 degrees of freedom.
  • χ x 2 , χ y 2 , χ z 2 : Result of calculating the following statistics:
    χ x 2 = ( n 1 ) S x 2 σ 0 x 2 ; χ y 2 = ( n 1 ) S y 2 σ 0 y 2 ; χ z 2 = ( n 1 ) S z 2 σ 0 z 2

Appendix C. NSSDA

1.
Select a sample of n points, where n 20 .
2.
Calculate the error for each point in each component:
e x i = x p i x i , e y i = y p i y i , e z i = z p i z i
where:
  • x i , y i , z i are the coordinates in the reference (RDS).
  • x p i , y p i , z p i are the coordinates in the product (ADS).
3.
Calculate the mean error of each component:
M S E x = e x i 2 n ; M S E y = e y i 2 n ; M S E z = e z i 2 n
4.
Obtain the horizontal N S S D A H value:
  • if M S E x = M S E y ,
    N S S D A H = 2.4477 2 M S E r = 2.4477 M S E r
    where:
    M S E r = M S E x 2 + M S E y 2
  • if M S E x M S E y and 0.6 < M S E m i n / M S E m a x < 1.0
    N S S D A H = 2.4477 × 0.5 × ( M S E x + M S E y )
5.
Obtain the vertical N S S D A z value according to the following expression:
N S S D A z = 1.9600 × M S E z

References

  1. Ariza-López, F.J. Calidad en la Producción Cartográfica; RA-MA: Madrid, Spain, 2002; ISBN 978-84-7897-524-2. [Google Scholar]
  2. Church, R.; Curtin, K.; Fohl, P.; Funk, C.; Goodchild, M.; Kyriakidis, P.; Noronha, V. Positional Distortion in Geographic Data Sets as a Barrier to Interoperation. In Proceedings of the American Congress on Surveying and Mapping Annual Conference, Baltimore, MD, USA, 6–10 April 1998; Technical Papers. pp. 377–387. [Google Scholar]
  3. Ariza-López, F.J. (Ed.) Fundamentos de Evaluación de la Calidad de la Información Geográfica; Universidad de Jaén: Jaén, Spain, 2013; ISBN 978-84-8439-813-4. [Google Scholar]
  4. Maune, D.F.; Navegandhi, A. (Eds.) Digital Elevation Model Technologies and Applications: The Dem User’s Manual; American Society for Photogrammetry and Remote Sensing: Bethesda, MD, USA, 2019. [Google Scholar]
  5. Guth, P.L.; Von Niekerk, A.; Grohmann, C.H.; Muller, J.P.; Hawker, L.; Florinsky, I.V.; Gesch, D.; Reuter, H.I.; Herrera-Cruz, V.; Riazanoff, S.; et al. Digital Elevation Models: Terminology and Definitions. Remote Sens. 2021, 1318, 3581. [Google Scholar] [CrossRef]
  6. Gomez, C.; Hayakawa, Y.; Obanawa, H. A study of Japanese landscapes using structure from motion derived DSMs and DEMs based on historical aerial photographs: New opportunities for vegetation monitoring and diachronic geomorphology. Geomorphology 2015, 242, 11–20. [Google Scholar] [CrossRef] [Green Version]
  7. Saksena, S.; Merwade, V. Incorporating the effect of DEM resolution and accuracy for improved flood inundation mapping. J. Hydrol. 2015, 530, 180–194. [Google Scholar] [CrossRef] [Green Version]
  8. Juel, A.; Groom, G.B.; Svenning, J.C.; Ejrnaes, R. Spatial Application of Random Forest models for fine-scale coastal vegetation classification using object based analysis of aerial orthophoto and DEM data. Int. J. Appl. Earth Obs. 2015, 42, 106–114. [Google Scholar] [CrossRef]
  9. Rekha, P.N.; Gangadharan, R.; Ravichandran, P.; Mahalakshmi, P.; Panigrahi, A.; Pillai, S.M. Assessment of impact of shrimp farming on coastal groundwater using Geographical Information System based Analytical Hierarchy Process. Aquaculture 2015, 448, 491–506. [Google Scholar] [CrossRef]
  10. Stroeven, P.; Li, K.; Le, N.L.B.; He, H.; Stroeven, M. Capabilities for property assessment on diferent levels of the microstrcutre of DEM-simulated cementitious materials. Constr. Build. Mater. 2015, 88, 105–117. [Google Scholar] [CrossRef]
  11. UN-GGIM. The Global Fundamental Geospatial Data Themes. 2019. Available online: https://ggim.un.org/documents/Fundamental%20Data%20Publication.pdf (accessed on 28 March 2022).
  12. EU. Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 Establishing an Infrastructure for Spatial Information in the European Community (INSPIRE) 14.03.2007. 2007. Available online: https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32007L0002 (accessed on 28 March 2022).
  13. Mesa-Mingorance, J.L.; Ariza-López, F.J. Accuracy Assessment of Digital Elevation Models (DEMs): A Critical Review of Practices of the Past Three Decades. Remote Sens. 2020, 12, 2630. [Google Scholar] [CrossRef]
  14. Ariza-López, F.J.; Chicaiza-Mora, E.G.; Mesa-Mingorance, J.L.; Cai, J.; Reinoso-Gordo, J.F. DEMs: An Approach to Users and Uses from the Quality Perspective. Int. J. Spat. Data Infrastruct. Res. 2018, 13, 131–171, Special Section: INSPIRE (Full Research Article). Available online: https://ijsdir.sadl.kuleuven.be/index.php/ijsdir/article/download/469/430 (accessed on 28 March 2022).
  15. Wechsler, S.P. Uncertainties Associated with Digital Elevation Models for Hydrologic Applications: A Review. Hydrol. Earth Syst. Sci. 2007, 11, 1481–1500. [Google Scholar] [CrossRef] [Green Version]
  16. Hengl, T.; Heuvelink, G.B.M.; Van Loon, E.E. On the Uncertainty of Stream Networks Derived from Elevation Data: The Error Propagation Approach. Hydrol. Earth Syst. Sci. 2010, 14, 1153–1165. [Google Scholar] [CrossRef]
  17. Höhle, J.; Potuckova, M. Assessment of the Quality of Digital Terrain Models; Official Publication No. 60 of European Spatial Data Research; Gopher: Amsterdam, The Netherlands, 2011. [Google Scholar]
  18. Ariza-López, F.J.; García-Balboa, J.L.; Rodríguez-Avi, J.; Robledo, J.; Guide for the Positional Accuracy Assessment of Geospatial Data. Pan American Institute of Geography and History, Occasional Publication 563. 2021. Available online: http://publicaciones.ipgh.org/publicaciones-ocasionales/Guide-for-the-positional-acuracy-assessment%20of%20geospatial-data_publ563.pdf (accessed on 28 March 2022).
  19. USBB. United States National Map Accuracy Standards; U.S. Bureau of the Budget: Washington, DC, USA, 1947. [Google Scholar]
  20. ASPRS. ASPRS Positional accuracy standards for digital geospatial data. Photogramm. Eng. Remote Sens. 2015, 81, A21–A26. [Google Scholar]
  21. FGDC. FGDC-STD-007: Geospatial Positioning Accuracy Standards, Part 3; National Standard for Spatial Data Accuracy, Federal Geographic Data Committee: Reston, VA, USA, 1998. Available online: https://www.fgdc.gov/standards/projects/accuracy/part3/chapter3 (accessed on 28 March 2022).
  22. Ariza-López, F.J.; Rodríguez-Avi, J. A statistical model inspired by the National Map Accuracy Standard. Photogramm. Eng. Remote Sens. 2014, 80, 271–281. [Google Scholar] [CrossRef]
  23. ASCE. Map Uses, Scales and Accuracies for Engineering and Associated Purposes; American Society of Civil Engineers, Committee on Cartographic Surveying, Surveying and Mapping Division: New York, NY, USA, 1983. [Google Scholar]
  24. Zandbergen, P.A. Positional Accuracy of Spatial Data: Non-Normal Distributions and a Critique of the National Standard for Spatial Data Accuracy. Trans. GIS 2008, 12, 103–130. [Google Scholar] [CrossRef]
  25. Zandbergen, P.A. Characterizing the error distribution of Lidar elevation data for North Carolina. Int. J. Remote Sens. 2011, 32, 409–430. [Google Scholar] [CrossRef]
  26. Höhle, J.; Höhle, M. Accuracy assessment of digital elevation models by means of robust statistical methods. ISPRS J. Photogramm. Remote Sens. 2009, 64, 398–406. [Google Scholar] [CrossRef] [Green Version]
  27. Ariza-López, F.J.; Rodríguez-Avi, J.; González-Aguilera, D.; Rodríguez-Gonzálvez, P. A New Method for Positional Accuracy Control for Non-Normal Errors Applied to Airborne Laser Scanner Data. Appl. Sci. 2019, 9, 3887. [Google Scholar] [CrossRef] [Green Version]
  28. Cheok, G.; Filliben, J.; Lytle, A.M. NISTIR 7638. Guidelines for Accepting 2D Building Plans; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2008. [Google Scholar]
  29. McLachlan, G.J.; Peel, D. Finite Mixture Models; Wiley Series in Probability and Statistics: New York, NY, USA, 2000. [Google Scholar]
  30. McLachlan, G.J.; Lee, S.X.; Rathnayake, S.I. Finite Mixture Models. Annu. Rev. Stat. Its Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
  31. Rodríguez-Avi, J.; Ariza-Lopez, F.J. Finite mixtures of normal distributions in the study of the error in altimetry. Adv. Cartogr. Giscience Int. Cartogr. Assoc. 2021, 3, 13. [Google Scholar] [CrossRef]
  32. Rodríguez-Avi, J. A Probabilistic Model for the Distribution of GDP per Capita in NUTS 3 Zones of Europe. Stud. Appl. Econ. 2022, 40, 5326. [Google Scholar] [CrossRef]
  33. Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar]
  34. McLachlan, G.J.; Krishnan, T. The EM Algorithm and Extensions, 2nd ed.; Jonh Wiley and Sons, Inc.: Hoboken, NJ, USA, 2008. [Google Scholar]
  35. Cueva-López, V.; Olmo-Jiménez, M.J.; Rodríguez-Avi, J. EM algorithm for an extension of the Waring distribution. Comput. Math. Methods 2019, 1, e1046. [Google Scholar] [CrossRef] [Green Version]
  36. Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data, 2nd ed.; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
  37. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer Science & Business Media: New York, NY, USA, 2003. [Google Scholar]
  38. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: https://www.R-project.org/ (accessed on 28 March 2022).
  39. Benaglia, T.; Chauveau, D.; Hunter, D.R.; Young, D. mixtools: An R Package for Analyzing Finite Mixture Models. J. Stat. Softw. 2009, 32, 1–29. [Google Scholar] [CrossRef] [Green Version]
  40. IPGH. Instituto Panamericano de Geografia e Historia: Especificaciones para Mapas Topográficos; Instituto Panamericano de Geografía e Historia—IPGH: Panamá City, Panama, 1978. [Google Scholar]
  41. Ariza-López, F.J.; Atkinson, A.D.J.; Rodríguez-Avi, J. Acceptance curves for the positional control of geographic data bases. J. Surv. Eng. 2008, 134, 26–32. [Google Scholar] [CrossRef] [Green Version]
  42. Ariza López, F.J.; Atkinson, A.D.J. Variability of NSSDA Estimations. J. Surv. Eng. 2008, 134, 39–44. [Google Scholar] [CrossRef]
  43. Rodríguez-Avi, J. Caracterización del error en MDE por mixtura de distribuciones. Rev. Cart. 2021, 103, 123–143. [Google Scholar] [CrossRef]
  44. Guanchun, L.; Chien-Chiang, L.; Yuanyuan, L. Growth path heterogeneity across provincial economies in China: The role of geography versus institutions. Empir. Econ. 2020, 59, 503–546. [Google Scholar] [CrossRef]
  45. Pani, A.; Sahu, P.K.; Majumdar, B.B. Expenditure-based segmentation of freight travel markets: Identifying the determinants of freight transport expenditure for developing marketing strategies. Res. Transp. Bus. Manag. 2019, 33, 100437. [Google Scholar] [CrossRef]
Figure 1. The simulation process for the comparison of the analyzed positional accuracy assessment approaches.
Figure 1. The simulation process for the comparison of the analyzed positional accuracy assessment approaches.
Remotesensing 14 02062 g001
Figure 2. Map of the zone of Allo in Navarra (Spain).
Figure 2. Map of the zone of Allo in Navarra (Spain).
Remotesensing 14 02062 g002
Figure 3. Discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Figure 3. Discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Remotesensing 14 02062 g003
Figure 4. Histogram of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Figure 4. Histogram of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Remotesensing 14 02062 g004
Figure 5. Trimmed histogram of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Figure 5. Trimmed histogram of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Remotesensing 14 02062 g005
Figure 6. Normal QQ-plot of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Figure 6. Normal QQ-plot of the discrepancies model (error model) of the Allo zone (DEM05-DEM02).
Remotesensing 14 02062 g006
Figure 7. Observed histogram (EDOD) and density function derived from the finite mixture model (FMM) for the discrepancies (colored curve).
Figure 7. Observed histogram (EDOD) and density function derived from the finite mixture model (FMM) for the discrepancies (colored curve).
Remotesensing 14 02062 g007
Table 1. Values of AIC and BIC for g from 2 to 10 for the estimation of the finite mixture model.
Table 1. Values of AIC and BIC for g from 2 to 10 for the estimation of the finite mixture model.
g AIC BIC Iterations
2125,846.6125,913.248
381,894.581,994.5168
477,244.177,377.41059
574,966.875,133.51923
674,290.374,490.24757
774,173.874,407.113,332
874,179.874,446.4115,186
974,166.274,466.2461,837
1074,167.974,501.2400,682
Table 2. Estimated parameters for each component of the finite mixture model based on 7 normal distributions.
Table 2. Estimated parameters for each component of the finite mixture model based on 7 normal distributions.
Component μ ^ i σ ^ i π ^ i
1−7.7813510.221950.00025
2−0.018370.269770.18361
3−0.083780.056880.08837
40.062090.517930.16441
5−0.024140.138350.52425
60.325960.941850.03558
71.191202.592390.00353
Table 3. Comparison of quantiles for the Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data ( E D O D ).
Table 3. Comparison of quantiles for the Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data ( E D O D ).
QuantileValue
EDOD FMM 1 NDM
2.5%−0.61349−0.61378−0.82057
5%−0.42628−0.42648−0.68751
10%−0.27963−0.27943−0.53553
25%−0.13934−0.13953−0.28157
50%−0.02975−0.029800.00062
75%0.106380.106200.28279
90%0.301690.301200.53676
95%0.536190.536780.68875
97.5%0.812860.814070.82057
Table 4. Example of probabilities for several discrepancy intervals using de Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data ( E D O D ).
Table 4. Example of probabilities for several discrepancy intervals using de Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data ( E D O D ).
Interval (m)Value
EDOD FMM 1 NDM
X < 0.5 0.037780.037670.11572
X < 1 0.007050.007060.00838
0.5 < X < 0.8 0.029190.029270.00883
X > 0.5 0.055020.055130.11623
X > 0.41835 0.068900.069080.15901
| X | > 0.01 0.958070.957810.98093
| X | > 0.05 0.789080.789640.90487
| X | > 0.10 0.409060.409390.18892
| X | > 0.20 0.315940.316000.63260
| X | > 0.50 0.092800.092800.23202
| X | > 1 0.023050.023190.01683
Table 5. Mean, standard error of the mean, estimated standard deviation of the population, mean of variances, mean of standard deviations and square root of the mean of variances for each simulated sample size n (based on 5000 iterations).
Table 5. Mean, standard error of the mean, estimated standard deviation of the population, mean of variances, mean of standard deviations and square root of the mean of variances for each simulated sample size n (based on 5000 iterations).
n FMM
Meansd σ ^ s n 2 ¯ s n ¯ s n 2 ¯
200.001860.098670.441260.20200.34310.4494
300.002360.077500.424460.17960.34970.4238
400.006540.067910.429520.17970.35590.4240
500.005180.059210.418680.17340.35930.4164
800.005580.047720.426780.17270.36750.4156
1000.006260.041490.414880.17250.37000.4154
2000.005860.028990.409950.17420.38250.4173
5000.005550.018530.414280.17150.39120.4141
Table 6. Empirical quantiles in the Finite Mixture Model distribution of means for each sample size (n) (based on 5000 simulations).
Table 6. Empirical quantiles in the Finite Mixture Model distribution of means for each sample size (n) (based on 5000 simulations).
nQuantiles
0.010.0250.050.10.90.950.9750.99
20−0.2085−0.1410−0.1170−0.08850.10400.14100.17800.2240
30−0.1673−0.1280−0.1017−0.07630.08700.11700.14640.1950
40−0.1445−0.1035−0.0832−0.06500.08380.10950.13700.1760
50−0.1348−0.0946−0.0752−0.05700.07240.09620.11920.1510
80−0.1068−0.0789−0.0630−0.04500.06090.08140.09800.1174
100−0.0988−0.0672−0.0521−0.04000.05540.06920.08390.1033
200−0.0764−0.0501−0.0393−0.02790.04030.05000.05960.0729
500−0.0449−0.0326−0.0249−0.01720.02820.03480.04160.0477
Table 7. Empirical quantiles in the Finite Mixture Model distribution of variances for each sample size (n) (based on 5000 simulations).
Table 7. Empirical quantiles in the Finite Mixture Model distribution of variances for each sample size (n) (based on 5000 simulations).
nQuantiles
0.010.0250.050.10.90.950.9750.99
200.01830.02260.02740.03410.24900.35530.57851.2729
300.02340.02880.03440.04300.24240.34500.60151.2833
400.02710.03320.03930.04830.23730.37960.64891.2209
500.03320.03930.04380.05270.23510.34660.56531.2717
800.04300.04880.05580.06390.23540.35550.53910.9611
1000.04700.05320.05930.06710.23020.32820.50381.2471
2000.06240.06780.07360.08110.23710.33580.58141.5181
5000.07830.08290.08790.09400.22500.34830.66131.1634
Table 8. NMAS test: proportion of times where the null hypothesis is rejected for several metric tolerances ( α = 0.05 ) when using the Finite Mixture Model (FMM) and the One Normal Distribution Model 1NDM (based on 5000 simulations).
Table 8. NMAS test: proportion of times where the null hypothesis is rejected for several metric tolerances ( α = 0.05 ) when using the Finite Mixture Model (FMM) and the One Normal Distribution Model 1NDM (based on 5000 simulations).
nTol [m] FMM 1 NDM Tol [m] FMM 1 NDM
200.010.42680.42680.200.06480.0000
300.26640.26640.06300.0000
400.17340.17340.05090.0000
500.12480.12480.07410.0000
800.13680.03020.06850.0000
1000.07200.00980.07310.0000
2000.06580.00220.05640.0000
5000.06400.00000.05510.0000
200.050.04950.00830.500.10920.0003
300.10070.00740.05520.0000
400.05130.00130.07740.0000
500.07480.00030.08940.0000
800.06910.00010.06210.0000
1000.08160.00000.08260.0000
2000.06630.00000.04870.0000
5000.04450.00000.06070.0000
200.100.08050.00041.000.10920.0003
300.07890.00020.15180.1518
400.05750.00000.06410.2356
500.07490.00000.11150.1115
800.07700.00000.11310.2840
1000.06590.00000.08210.1997
2000.05100.00000.09270.1740
5000.05020.00000.07950.3716
Table 9. EMAS test: proportion of times where the null hypothesis is rejected (mean case) when using the FMM and the 1NDM (based on 5000 simulations).
Table 9. EMAS test: proportion of times where the null hypothesis is rejected (mean case) when using the FMM and the 1NDM (based on 5000 simulations).
ModelnSelected Values of α
0.010.050.10.90.950.99
F M M 200.00740.05570.11200.09840.05050.0131
300.00930.04870.10410.09560.04870.0093
400.01080.06100.11520.07830.03970.0066
500.01030.06180.12070.08460.04230.0086
800.01090.05750.12620.07770.03470.0072
1000.01110.07390.12940.07370.03950.0070
2000.01130.06960.13870.07260.03560.0064
5000.01520.07620.15330.05480.02360.0043
1 N D M 200.01670.07640.13680.07450.02650.0015
300.01730.07170.13050.07780.02770.0023
400.01770.07060.13090.08070.02970.0028
500.01750.06690.12600.07940.03240.0030
800.01600.06520.11920.08620.03580.0039
1000.01560.06380.11500.08740.03610.0035
2000.01300.05790.10760.09600.03970.0058
5000.01120.04890.09870.10220.04750.0072
Table 10. EMAS test: proportion of times where the null hypothesis is rejected (variance case) when using the FMM and the 1NDM (based on 5000 simulations).
Table 10. EMAS test: proportion of times where the null hypothesis is rejected (variance case) when using the FMM and the 1NDM (based on 5000 simulations).
ModelnSelected Values of α
0.010.050.10.90.950.99
F M M 200.01360.05530.10050.09800.05430.0113
300.01040.04840.10330.09760.05330.0091
400.00740.04450.09590.09830.04660.0087
500.01050.04370.09140.09880.04990.0065
800.00990.05270.10010.09870.05350.0095
1000.00960.04770.09350.10350.05990.0070
2000.01160.05240.10280.10440.04980.0075
5000.00950.04500.09150.09890.04270.0095
1 N D M 200.39420.39410.39010.38960.46030.6090
300.44230.44480.44340.35900.42750.5531
400.47720.48090.48000.33760.39560.5154
500.50760.50840.50550.32130.37410.4905
800.56390.56310.56760.29260.34040.4359
1000.58750.59080.58850.28530.32190.4081
2000.64450.64530.64320.26560.29680.3523
5000.67470.67130.67330.26820.28710.3246
Table 11. EMAS test: proportion of times of global rejections when using the FMM and the 1NDM with (*) and without (**) applying the Bonferroni’s correction (based on 5000 simulations).
Table 11. EMAS test: proportion of times of global rejections when using the FMM and the 1NDM with (*) and without (**) applying the Bonferroni’s correction (based on 5000 simulations).
ModelnSelected Values of α
0.05 (*)0.05 (**)0.10 (*)0.10 (**)
F M M 200.08290.03780.15870.0844
300.07720.03250.15470.0760
400.07270.03290.16090.0727
500.08060.03870.16530.0796
800.07810.03700.15330.0769
1000.09160.03590.17860.0911
2000.07910.03180.16910.0800
5000.07230.03280.15960.0709
1 N D M 200.13070.09280.19650.1305
300.13660.10130.20110.1355
400.13990.11050.20500.1406
500.14810.11430.20900.1468
800.15980.12360.21530.1572
1000.16410.13030.22020.1613
2000.18220.15310.23500.1832
5000.19730.16990.24660.1983
Table 12. Mean of the distribution of 97.5% quantile and its typified value for the Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data E D O D (based on 5000 simulations).
Table 12. Mean of the distribution of 97.5% quantile and its typified value for the Finite Mixture Model ( F M M ), the One Normal Distribution Model ( 1 N D M ) and the empirical distribution of observed data E D O D (based on 5000 simulations).
n EDOD FMM 1 NDM
μ Q 0.975 Z Q 0.975 μ Q 0.975 Z Q 0.975 μ Q 0.975 Z Q 0.975
200.66271.58250.65641.56760.69361.6566
300.69401.65750.69351.65640.72491.7314
400.69011.64830.68931.64610.73481.7551
500.71551.70880.71791.71460.74991.7912
800.74051.76860.74181.77170.77401.8487
1000.75861.81190.76591.82940.78731.8805
2000.77811.85850.78741.88070.80101.9131
5000.80091.91290.80621.92570.81211.9397
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Rodríguez-Avi, J.; Ariza-López, F.J. Finite Mixture Models in the Evaluation of Positional Accuracy of Geospatial Data. Remote Sens. 2022, 14, 2062. https://doi.org/10.3390/rs14092062

AMA Style

Rodríguez-Avi J, Ariza-López FJ. Finite Mixture Models in the Evaluation of Positional Accuracy of Geospatial Data. Remote Sensing. 2022; 14(9):2062. https://doi.org/10.3390/rs14092062

Chicago/Turabian Style

Rodríguez-Avi, José, and Francisco Javier Ariza-López. 2022. "Finite Mixture Models in the Evaluation of Positional Accuracy of Geospatial Data" Remote Sensing 14, no. 9: 2062. https://doi.org/10.3390/rs14092062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop