1. Introduction
In recent years, crowdsourced geospatial platforms have evolved into complex socio-technical systems, in which geospatial data are continuously produced, modified, and validated through decentralized human–technology interactions. Within such systems, data quality and reliability do not result from centralized control mechanisms but rather emerge from the collective behavior of contributors over time. Ensuring the trustworthiness of data generated in these open, volunteered, and participatory systems has thus become a critical challenge for data-driven decision-making processes, including urban planning, infrastructure management, and public services.
The integration of Web 2.0 technologies with geospatial tools has fundamentally transformed the generation, modification, and utilization of geographic information. This technological convergence enables users to actively contribute spatial data through participatory mapping platforms, resulting in dynamic and continuously evolving datasets [
1,
2,
3]. In this context, geographic information is no longer produced through a linear data collection process but through an interactive system in which individual contributions collectively shape the structure and quality of the dataset. A pivotal development in this domain is the concept of Volunteered Geographic Information (VGI), introduced by Goodchild [
4], which represents a shift from traditional top-down data production toward decentralized, user-driven geospatial systems.
The significance of VGI lies in its ability to overcome the limitations of traditional geospatial data collection methods, which are often costly and time-consuming. By leveraging the power of crowdsourcing, VGI offers a scalable alternative that provides timely and diverse geographic information across various domains, including disaster management, transportation planning, and urban development [
4,
5,
6,
7]. This shift has effectively transformed passive data consumers into active contributors, thereby democratizing the production of geographic information.
However, the characteristics that make VGI systems open and scalable also introduce significant challenges related to data quality and reliability. The diversity of contributor backgrounds, the absence of formal training, and the lack of enforced standards transform data quality into a system-level property that varies across space and time. Unlike authoritative datasets governed by centralized validation procedures, crowdsourced geospatial systems rely on decentralized and heterogeneous contributor behavior, resulting in varying levels of consistency, accuracy, and temporal relevance [
7,
8,
9,
10]. These characteristics make conventional quality assurance mechanisms insufficient for such systems and necessitate alternative approaches that can account for their intrinsic dynamics.
To address these quality concerns, researchers have developed two primary assessment approaches: extrinsic and intrinsic. Extrinsic evaluation involves comparing VGI datasets with authoritative reference data to identify discrepancies and assess positional accuracy [
11,
12]. However, this approach is fundamentally limited by the scarcity of high-quality reference datasets with global coverage [
13,
14]. In response to these limitations, intrinsic assessment methods have gained prominence. These techniques evaluate data quality based on internal characteristics such as contributor reputation, edit history, and spatiotemporal consistency, without requiring external benchmarks [
15,
16,
17,
18,
19]. While intrinsic methods offer a more scalable solution, existing research has predominantly focused on linear features [
12,
20,
21], leaving a significant gap in the assessment of polygonal VGI data—despite their critical importance in applications such as 3D building reconstruction [
22,
23], urban studies [
24], and disaster management [
25]. Furthermore, current intrinsic assessment methodologies typically rely on expert-driven weighting of quality indicators through multi-criteria decision-making processes, which are inherently subjective and time-consuming [
26].
Among various crowdsourced geospatial platforms, OpenStreetMap (OSM) has become a prominent example due to its global coverage, continuous updates, and large volunteer community. However, the reliability of its contributed data remains a key concern that limits broader adoption in analytical and operational contexts. The expansion of this platform has facilitated the rapid collection of vast amounts of geospatial data. However, the quality and reliability of this data often remain uncertain due to the decentralized and heterogeneous nature of the underlying contribution processes. Without systematic assessment and labeling mechanisms, using such data for scientific research, urban planning, or crisis management may lead to unreliable or biased outcomes. Moreover, most existing studies evaluate data quality at aggregated spatial or thematic levels, overlooking the fact that reliability in participatory geospatial platforms emerges from the interaction of individual contributors, editing histories, and feature-level characteristics. As a result, there is a growing need for system-level approaches capable of modeling reliability as a dynamic property of the data production process, rather than as a static attribute of the final dataset.
In response to these challenges, this study proposes a system-oriented framework for estimating the reliability of crowdsourced building polygons using intrinsic quality indicators and unsupervised machine learning techniques. By analyzing contributor behavior, temporal metadata, and geometric evolution extracted from the OpenStreetMap history, the proposed framework models reliability as an emergent property of the underlying participatory system. The framework eliminates reliance on authoritative reference datasets and expert-based weighting, thereby enhancing objectivity, scalability, and reproducibility. The main contributions of this research are (1) the development of intrinsic reliability indicators tailored to polygonal geospatial data; (2) the introduction of an unsupervised, data-driven approach for criterion weighting; (3) feature-level reliability classification within a large-scale crowdsourced system; and (4) the provision of a transferable methodological foundation for trust-aware analysis in open geospatial systems.
Our findings have significant implications for both academic research and practical applications, offering urban planners, emergency responders, and GIS professionals robust tools to assess the suitability of VGI data for critical decision-making processes. Unlike previous studies, which depended heavily on reference data or expert evaluations, this research utilizes unsupervised learning and data mining techniques to deliver a scalable, data-driven solution for trust assessment in VGI environments. Ultimately, this work contributes to the broader objective of establishing VGI as a reliable complement to authoritative geospatial data sources.
This paper is organized as follows. After an introductory assessment of intrinsic quality,
Section 2 provides a detailed discussion of the materials and proposed approach.
Section 3 examines and evaluates the outcomes derived from implementing the proposed method. Finally,
Section 4 and
Section 5 present the discussion and conclusion.
Intrinsic Assessment of Quality
Intrinsic assessment of quality refers to the evaluation of data trustworthiness based on the internal characteristics of the dataset itself, without relying on external reference datasets. This approach is particularly valuable in the context of VGI, where authoritative ground-truth data may be unavailable, inconsistent, or impractical to obtain at scale. One of the key concepts employed in intrinsic assessment is reliability. The concept of reliability in geospatial data was first introduced by Azouzi [
27], who emphasized the role of user trust in assessing the credibility of geospatial information. Reliability in a dataset represents the degree of confidence or trust that users might have. It directly relates to data quality; more dependable data is generally associated with higher quality. When a dataset consistently produces accurate and valid results, users can rely on it for analysis, decision-making, and other purposes, which is considered reliable [
28].
Quality and reliability are two closely related concepts in the context of data quality analysis. Given the importance of geospatial data sharing and the increasing use of crowdsourced geospatial data in geospatial information systems, evaluating the reliability of geospatial data has become a significant area of research [
29]. Another motivation for investigating the reliability of crowdsourced geospatial information is its potential feasibility as a substitute for established geospatial data sources. Data reliability is influenced by the volunteer nature of contributions, the concept of developing crowdsourced geographic information infrastructures based on open data input, and the characteristics of contributors [
30,
31].
This example highlights the increasing use of VGI by established mapping organizations. VGI is a substantial and frequently updated source of publicly available geographic data, making it highly valuable. However, the reliability of VGI depends on the quality of data contributed by individuals with varying levels of expertise. Despite this, VGI offers organizations significant cost-saving opportunities by reducing the need for expensive data collection methods and helping to improve the accuracy of maps and geographic information. Thus, the advantages of using VGI are balanced against potential risks to data validity and quality.
To ensure integration and consistency in assessing spatial accuracy, attribute accuracy, logical consistency, completeness, data generation processes, and metadata, a comprehensive set of geospatial data quality control standards has been established [
32]. The most recent version of these standards is ISO 19157:2013 [
33].
One of the primary challenges in applying geographic information quality standards to crowdsourced data arises from the diverse processes involved. This issue poses a significant barrier to extending the application of spatial information quality standards to the domain of crowdsourced geospatial data. However, certain quality criteria for crowdsourced geospatial data—such as positional accuracy, completeness, and logical consistency—have been implemented based on related concepts from the aforementioned standards. Teimoory et al. [
28] defined the concept of reliability using the data history file and quantitative measures, including the number of versions, the number of contributors, temporal changes, and the number of tag edits. To evaluate reliability, the results of the proposed model were compared with existing official data from the study area, and the role of each criterion in data conformity and reliability was examined.
Intrinsic evaluation approaches focus on documented occurrences throughout the data lifecycle, fundamental data attributes, and the development of estimation models to predict quantifiable metrics. These methodologies address the problem of data quality [
8,
13,
16,
28,
34,
35,
36,
37,
38,
39].
Reliability in volunteered geographic information is defined as the degree of confidence that users have in the information associated with a given feature. Haklay et al. [
39] emphasized the importance of reliability in the use of VGI, which has contributed to its increasing adoption.
In this study, after identifying and evaluating relevant criteria from the literature review, six criteria were extracted from the data source to assess the reliability of polygon data: the number of contributors, the number of versions, the creation date, the last edit date, the number of tag edits, and cumulative area change. Each of these criteria is described below.
Number of Contributors: This metric reflects the total number of distinct users who have contributed to creating or editing a feature, such as a building, in OSM. According to the “many-eyes” principle, the reliability of a feature increases with the number of contributors involved in its editing. In other words, the more users who validate a feature, the higher its reliability [
40]. Therefore, this criterion is regarded as a positive indicator.
Number of versions: The number of versions refers to how many times a feature—such as a building, road, or park—has been edited, updated, or corrected by users over time. A higher number of versions indicates that the feature has undergone more revisions, suggesting ongoing quality improvement. Therefore, a greater number of versions in the history file implies enhanced quality and increased reliability of the feature [
16]. This criterion is considered a positive factor.
Creation Date: In geospatial data, the creation date of a feature refers to the date when a volunteer user first created that feature. In the OSM Full History file, each feature—such as a building—has a temporal series of versions, with the first record in this series representing the creation date of the feature. The earlier the creation date, the greater the likelihood of subsequent edits. Since the recorded value represents the time difference (in days) between a reference date (1 January 1970) and the feature’s creation date, a lower value indicates an earlier creation. Therefore, this is considered a negative criterion.
Last edit date: This refers to the most recent modification made to a feature, which may include changes to its geometry (such as location or shape) or its descriptive attributes (tags). The more recent the last edit date of a feature, the higher its reliability [
16]. It is important to note that the value obtained for this criterion represents the number of days between the reference date and the feature’s last edit date. A higher value indicates more up-to-date data. Therefore, this criterion is considered a positive indicator.
Number of tag edits: This refers to the number of times the descriptive tags of a feature have been modified throughout its history. The frequency of edits and corrections made to a feature’s tags (e.g., a building’s land use tag) indicates a degree of ambiguity [
28,
40]. A higher number of tag edits corresponds to lower reliability of the feature. Therefore, this criterion is considered a negative indicator.
Accumulative Area Changes: The accumulative area change is a metric used to quantify the extent of geometric changes in the area of a feature (such as a building) over time and across its different versions. It is calculated for each version relative to the previous one, and the cumulative sum is recorded for the corresponding polygon ID in the latest version. This metric indicates the rate at which the area is increasing or decreasing (see Equation (1)). Essentially, the trend of polygon area changes in the history file is analyzed by calculating the area change indicator. This metric is considered a positive criterion.
2. Materials and Methods
This research introduces a novel approach for evaluating the reliability of volunteered geospatial polygon data without relying on reference data. Instead, the method utilizes intrinsic criteria—fundamental characteristics inherent within the data itself—to assess reliability systematically. By analyzing these criteria, the study derives key insights into data quality and employs machine learning to identify the criteria that most significantly influence reliability.
The approach integrates two primary sources of OpenStreetMap (OSM) data: historical records and the most recent OSM dataset, ensuring a comprehensive assessment. Subsequently, the study evaluates the effectiveness of the proposed method, with a specific focus on its performance in measuring spatial accuracy.
Figure 1 outlines the overall approach, and the subsequent sections provide a detailed exploration of each step in the process. This structured methodology not only improves the understanding of the reliability of volunteered geospatial data but also offers a scalable tool for future quality assessments.
2.1. Data Collection
The required data were collected from two sources: the OSM history data and the latest OSM version. Both files were downloaded from the official OpenStreetMap repository (
https://planet.openstreetmap.org (accessed on 12 February 2025)), which are 214 GB and 75 GB, respectively, in ZIP format. The history file contains the complete edit history of all features (points, lines, and polygons) and documents every user contribution. This comprehensive record enables detailed temporal and structural analysis. To prepare the building polygon data for the study area, the relevant subsets were extracted from the comprehensive history file and from the latest file and converted to the XML (Extensible Markup Language) format.
An XML format is a hierarchical and tag-based structure. In this structure, each geographic feature—such as a node (point), way (line or polygon), or relation—is represented as a discrete XML element, accompanied by metadata including version number, timestamp, user ID, and associated tags.
Figure 2 shows an example of an OSM data history file in XML format, corresponding to a polygon feature.
2.2. Preprocessing and Data Preparation
To ensure spatial consistency across the dataset, the OSM-derived data were projected into a common coordinate system, specifically the Universal Transverse Mercator (UTM)-Zone 38 and the WGS84 reference ellipsoid. Polygon features of the study area were then extracted from the OSM history file. During preprocessing, geometric inconsistencies—including duplicate polygons, overlapping features, and cross-layer overlaps—were identified and corrected to improve geometric accuracy and data integrity.
2.3. Criteria Scaling
To perform criteria scaling, the data were converted into a standardized and comparable scale. Fuzzy membership function normalization, a commonly used method in fuzzy set theory, was employed. This approach is particularly effective when dealing with uncertainty in the data [
41,
42,
43].
In fuzzy membership normalization, data values are transformed into membership degrees ranging from 0 to 1, indicating the extent to which each data point belongs to a specific fuzzy set. This transformation provides a continuous representation of uncertainty, rather than confining data within rigid categorical boundaries. The commonly used sigmoid function is employed in this research [
28]. This method standardizes heterogeneous data, making it more suitable for further analysis while enhancing comparability and decision-making across different attributes. For positive criteria whose influence increases with higher values, Equation (2) is used. For negative criteria whose influence decreases as their values increase, Equation (3) is applied.
In these two equations,
represents the feature value of the
i-th instance for the
j-th criterion;
is the upper threshold value for positive criteria; and
is the upper threshold for negative criteria. The membership degree values (
) range between 0 and 1. For each criterion, two threshold values—upper and lower—are defined. For positive criteria, values equal to or greater than the upper threshold are normalized to one, while values equal to or less than the lower threshold are normalized to zero. Conversely, for negative criteria, values equal to or greater than the upper threshold are normalized to zero, and those equal to or less than the lower threshold are normalized to one. These thresholds are applied to eliminate outliers and incorporate expert judgment in constraining the data range. For example, from an expert perspective, a feature modified 10 times is conceptually similar to one modified 27 times. Thus, values equal to or greater than 10 are normalized to one, and the lowest values, which conceptually represent zero, are normalized to zero. The empirical values for the lower and upper thresholds in
Table 1 were defined through expert group consultation to minimize individual bias and enhance the reliability of the decisions, which is an approach recognized as effective for reducing subjectivity and improving validity [
44].
2.4. Clustering of Building Polygons Based on Reliability-Related Criteria
In the context of geospatial data quality assessment and the automation of this process, the application of machine learning—specifically unsupervised learning methods such as clustering—has shown significant potential for identifying structural patterns and grouping geospatial features based on intrinsic geometric and attribute characteristics. Building polygons can vary considerably in terms of their quality and geometric regularity. Reliability-related criteria, such as the inherent criteria discussed in the Section Intrinsic Assessment of Quality, serve as meaningful descriptors for assessing the quality of these polygons. To effectively group these polygons based on quality, the K-means clustering algorithm, a widely used unsupervised learning method, has been employed [
45]. This method partitions an input dataset into K disjoint clusters by minimizing the within-cluster sum of squared errors (SSE) [
46]. The algorithm uses distance as a similarity metric, which determines how objects are assigned to the nearest centroid. The most widely used metric in K-means is Euclidean distance [
47]. This distance reflects the geometric distance between two data points in an n-dimensional feature space. Equation (4) presents its squared form, which is preferred in K-means due to computational efficiency.
Each building polygon is encoded as a feature vector composed of its reliability-based geometric descriptors. K-means iteratively assigns polygons to the nearest cluster centroid, updating the centroids until convergence. This approach facilitates a data-driven classification of buildings into groups of similar criteria in terms of reliability. A critical aspect of implementing K-means is determining the optimal number of clusters
k, which greatly influences the clustering results. To address this, the Elbow Method is commonly applied and remains highly effective in practice, particularly for spatial data, where it has successfully identified optimal clusters [
48]. This widely used technique evaluates how the Sum of Squared Errors (SSE) varies with increasing values of
k. As more clusters are introduced, the SSE generally decreases because polygons are grouped into smaller, more homogeneous clusters, thus reducing intra-cluster variability. This combination of K-means clustering and the Elbow Method enables a robust unsupervised classification framework that can divide building polygons into an optimal number of reliability categories. The quality of a cluster
can be measured using intra-cluster variability, which is calculated as the sum of squared errors (SSE) of the distances between all objects within
and its centroid
. This objective function is formally defined in Equation (5) [
46]:
In this equation, E represents the sum of squared errors (SSE) for all entities in the dataset. signifies a point representing a data object; and refers to the centroid of the cluster . Both and are multidimensional; specifically, for each object within each cluster, the squared distance to its cluster centroid is calculated, and these values are aggregated across all objects and clusters. The objective function seeks to maximize the density of each cluster while maintaining its distinctiveness. Mathematically, the K-means algorithm steps are as follows:
- ○
Step 1: Initialization
- ▪
Randomly initialize k centroids:
- ○
Step 2: Assignment Step
- ▪
Assign each data point to the nearest cluster centroid using squared Euclidean distance based on Equation (6):
- ○
Step 3: Update step
- ▪
Recalculate the centroid
of each cluster
as the mean of all points assigned to that cluster based on Equation (7):
where
is the number of points in the cluster
.
- ○
Step 4: Convergence Check
- ○
Repeat steps 2 and 3 until no change happens in cluster assignments based on Equation (8):
2.5. Criteria Importance Assessment
Measuring the importance of criteria is a fundamental step in multi-criteria decision-making (MCDM) processes. Traditionally, expert judgment has been used to determine these measurements, which can sometimes lead to subjectivity, inconsistency, and time consumption. Recent studies indicate that objective weighting techniques and machine learning provide a consistent alternative by automatically computing criteria weights based on data patterns, thereby reducing the need for extensive expert involvement. Odu [
49] emphasized that although subjective methods are simple, they are prone to bias, whereas objective, data-driven approaches produce more reliable and reproducible weighted outcomes. Further research by Van Dua, et al. [
50] demonstrated that combining multiple objective weighting methods with machine learning-based ranking strategies enhances the consistency and stability of decision-making models. These findings confirm that employing machine learning for weighting criteria not only accelerates the decision-making process but also improves transparency, reduces errors associated with human judgment, and ensures reproducibility in complex multi-criteria contexts. In the absence of predefined reliability labels, a machine learning approach was implemented to derive the criteria weights objectively.
In the K-means clustering algorithm, the importance of the criteria can be objectively determined by analyzing the distance between cluster centroids and the global mean of each feature. In this approach, features that cause greater deviations of cluster centroids from the overall mean values are considered more discriminative and, therefore, more influential in the clustering process [
51,
52]. This approach provides a transparent, data-driven alternative to traditional expert-based weighting, aligning with the findings of Manzali, et al. [
53], where machine learning methods demonstrated high efficiency in evaluating feature relevance.
2.6. Reliability Class Estimation for Each Cluster
To evaluate the quality class associated with each of the five clusters, a scoring mechanism was developed. After the importance (weight) of each criterion was objectively calculated by analyzing its influence on cluster separation, as discussed in
Section 2.5, a weighted reliability score was computed for each polygon within each cluster. This was achieved by Equation (9):
where
is the reliability for polygon
i,
represents the normalized value of criterion
j for polygon
i, and
is the calculated importance of criterion
j.
2.7. Clustering Quality Assessment
Evaluating the quality of clustering is a critical step in ensuring that the identified groups are both meaningful and reliable. A valid clustering solution should ideally satisfy two conditions: (1) internal consistency, where objects within the same cluster are highly similar to one another and clearly distinct from objects in other clusters, and (2) spatial coherence, where clusters exhibit recognizable and interpretable patterns across the study area. To address these complementary dimensions, clustering quality was assessed using both an intrinsic validity index (the silhouette coefficient) and a spatial autocorrelation measure (Global Moran’s I).
2.7.1. Silhouette Coefficient
The silhouette coefficient is an internal metric that simultaneously evaluates intra-cluster cohesion and inter-cluster separation [
54,
55,
56]. This measure is widely used in clustering evaluation due to its intuitive interpretation and demonstrated reliability across various partitioning algorithms, including K-means. For each polygon (
i), the silhouette value is defined by Equation (10):
where
is the intra-cluster distance, computed as the average distance between object
i and all other objects within the same cluster. This term measures cluster cohesion Equation (11):
in which
denotes the set of objects in the same clusters as
, and
denotes the distance between objects
and
.
is the inter-cluster distance, defined as the minimum average distance between objects
and all objects in any other cluster
. This term measures separation (Equation (12)):
Silhouette coefficient with a value close to +1 indicates well separated and cohesive clusters, while values around 0 suggest that objects lie near boundaries between clusters, and negative values less than 0 to −1 imply possible misclassification. The average silhouette coefficient across all polygons provides an overall measure of clustering validity, serving as a theoretical basis for the practical evaluation in the implementation section.
2.7.2. Moran’s I
While the silhouette coefficient assesses the structural integrity of clusters in feature space, it does not account for their spatial arrangement. To complement this perspective, the Global Moran’s I statistic [
57] was applied to evaluate the degree of spatial autocorrelation among cluster assignments. Moran’s I is defined according to Equations (13) and (14):
where
n is the total number of spatial units (building polygons);
are attributes of spatial units, while
is the mean of the attribute values across all units.
spatial weight and
represents the sum of all spatial weights.
The spatial weight matrix generates based on a first-order rook contiguity approach, where two spatial units were considered neighbors only if they shared a common boundary edge. To ensure consistency across spatial units with differing numbers of neighbors, the spatial weights matrix was row-standardized, ensuring that the total weight distributed among a unit’s neighbors equals one. This measure serves as a global indicator of the overall spatial pattern of a variable. Values approaching +1 indicate strong positive spatial autocorrelation, meaning that similar values tend to cluster near each other. Values close to 0 suggest a random spatial distribution without a discernible pattern, whereas values approaching −1 reflect negative spatial autocorrelation, where dissimilar values are systematically interspersed throughout the space. Thus, this measure provides a spatial perspective that complements the silhouette coefficient and enables the assessment of how clusters ranging from “very low” to “very high” building reliability are distributed across the study area.
3. Results
The study area for this research is Tehran, the capital city of Iran (
Figure 3). The latest version and complete historical dataset of OSM for the study area were extracted in XML format from the comprehensive latest and historical files, as described in
Section 2.1. The dataset for these two subsets is approximately 818 MB for the latest version and around 123 GB for the historical file.
Before preprocessing, the raw OSM subset datasets included all feature types (points, lines, and polygons), containing a total of 1,137,139 features in the latest version and 3,181,423 features in the full history dataset. Among these two subsets, the number of polygon features (including all land use types) was 77,447 in the latest version and 154,307 in the history file. After filtering for building polygons, 58,550 features were identified in the latest version and 92,170 features in the history dataset. These building polygons were then used as the basis for further analysis and quality assessment within the scope of this study.
3.1. Extraction of the Criteria Value
Based on the description provided in the Section Intrinsic Assessment of Quality, the quantitative values for each criterion—including the number of versions, creation date in days, cumulative area change, number of contributors, last edit date in days, and number of tag edits—were extracted from both the latest version and the historical OSM file.
Table 2 presents an example of these values for each criterion related to building polygons. Total number of building polygons = 58,550.
After collecting the data related to the criteria, the values were standardized and transformed into a comparable format by converting them to a common scale ranging from 0 to 1 using a fuzzy membership function, as described in
Section 2.3.
Table 3 presents an example of normalized values for each criterion applied to building polygons.
3.2. Clustering of Building Polygons Based on Reliability
In this implementation phase, the K-means clustering algorithm was applied to classify building polygons into distinct groups based on their reliability criteria. K-means operates by minimizing the intra-cluster variance, specifically the sum of squared distances between data points and their corresponding cluster centroids. However, to effectively apply this method, it is essential to determine the optimal number of clusters,
k, which governs the granularity of classification and directly affects the interpretability of the results. As shown in
Figure 4, the sum of squared errors (SSE) exhibits a sharp change in slope at
k = 5. This elbow point suggests that the building polygons can be effectively grouped into five distinct reliability classes: very high, high, medium, low, and very low. Accordingly, the clustering process was executed with
k = 5, and each polygon was assigned to one of the resulting clusters based on its feature vector.
After determining the optimal number of clusters (k), the K-Means machine learning algorithm was applied to the dataset to partition the building polygons into distinct groups. This algorithm assigns each building polygon to one of the k clusters based on its similarity across selected features. Because the clustering process was guided by intrinsic criteria reflecting the reliability of VGI, the resulting clusters can be interpreted as representing different classes of data reliability.
To illustrate the outcome of the clustering process,
Table 4 provides a sample showing building polygons alongside their corresponding cluster labels.
3.3. Quantitative Class of Reliability for Each Cluster
To assess the qualitative class corresponding to each of the five identified clusters, a systematic scoring framework was implemented, as previously detailed in
Section 2.6. First, the relative importance of each criterion was obtained as shown in
Table 5.
The average reliability for each cluster was then calculated to rank the five clusters in ascending order. Clusters with higher mean scores were interpreted as representing building polygons with greater reliability, while those with lower scores indicated less reliability. The five clusters (
Table 6) were subsequently classified into five qualitative categories: very high, high, moderate, low, and very low.
Figure 5 shows the distribution of these clusters across the case study area. This classification provides an interpretable and quantitative basis for the overall and automated evaluation of building polygons within each cluster.
3.4. Correlation Matrix Between Criteria
To assess the independence and interrelationships among the criteria, pairwise correlations were calculated using the Pearson method.
Figure 6 presents a heatmap of the Pearson correlation matrix, illustrating the degree of association between the variables.
The interpretation of correlation coefficients in this study follows commonly accepted guidelines, where values less than 0.10 are regarded as irrelevant, 0.10–0.39 as weak, 0.40–0.69 as moderate, 0.70–0.89 as strong, and 0.90–1.00 as very strong. Two correlations exceed 0.7. As presented in
Figure 6, there is a strong positive correlation (r = 0.86) between the number of versions and the number of contributors, and a strong negative correlation (r = −0.76) between the creation date and the last edit date.
3.5. Clustering Quality Assessment
The quality of the K-means clustering solution was assessed using an approach that integrates both intrinsic and spatial perspectives. First, the silhouette coefficient was calculated to provide an internal validity measure of intra-cluster cohesion and inter-cluster separation, as described in
Section 2.7.1. The average silhouette score across all building polygons was 0.58, indicating that polygons within the same cluster were fairly well separated. This result confirms the internal validity of the five-cluster partition derived from the reliability-related criteria.
To complement this intrinsic evaluation, the Global Moran’s I as a spatial statistical measure was applied to determine whether clusters exhibited significant spatial autocorrelation or were randomly distributed across the study area, as described in
Section 2.7.2. When it was calculated using the categorical cluster field from k-means clustering, Moran’s I was 0.85 with a z-score of 107.9 and a
p-value < 0.001, indicating a very strong and highly significant clustered spatial pattern. This demonstrates that polygons belonging to the same reliability class are spatially concentrated rather than randomly distributed.
To further validate findings, Moran’s I was recalculated using the continuous reliability score, which was obtained based on reliability criteria. In this case, Moran’s I value was 0.59 with a z-score of 74.74 and a p-value < 0.001, again confirming a significant tendency toward spatial clustering. Although the magnitude of Moran’s I is lower for the continuous score compared to the categorical clusters, both results consistently show that building reliability values are spatially structured rather than randomly dispersed.
The results of these assessments are summarized in
Table 7, which consolidates the numerical values for the silhouette coefficient and Moran’s I under different input fields. Further interpretation of these results is presented in the discussion section.
This analysis serves as a validation step, confirming that the five clusters identified by the K-means algorithm and labeled with qualitative categories (very low, low, moderate, high, very high) correspond closely with the underlying reliability scores. In practical terms, if a cluster is labeled as Moderate, the majority of polygons within that cluster exhibit reliability values concentrated in the moderate range. This consistency indicates that the categorical classification not only reflects algorithmic grouping but also accurately represents the spatial distribution of building reliability.
4. Discussion
The result of the scoring framework reveals a clear differentiation in the reliability of the five clusters. Cluster 5 achieved the highest mean reliability (60.2%), representing polygons with very high reliability, while cluster 2 showed the lowest average score (22.2%), corresponding to the very low reliability class. Clusters 1 and 4 were placed in the moderate and low categories, respectively, whereas cluster 3 was assigned to the high reliability group. Some samples of polygons from each class and their calculated reliability are shown in
Figure 7.
This categorizing framework not only provides a systematic means of quantifying polygons’ reliability but also offers an interpretable basis for understanding how individual editing behaviors’ intrinsic criteria influence the trustworthiness of building polygons across the study area.
The correlation analysis demonstrates a positive association between the number of versions and the number of contributors (0.86), showing that these two measures represent the same editing activity. It is reasonable to retain only one or combine them into a new composite criterion, depending on data availability. It is suggested for future research to calculate a Collaboration Rate as the ratio of contributors to versions, which shows how collaborative editing is. Similarly, a strong negative correlation was found between creation date and last edit date (r = −0.76), reflecting the temporal dynamics of data. Rather than excluding one of these criteria, as future research, they can be combined into a composite measure called Data Activity Age, as the time interval between the last edit and creation date.
The results further validate that the five K-means clusters, which were qualitatively labeled as very low, low, moderate, high, and very high, are consistent with the underlying reliability scores. For instance, polygons grouped in the cluster labeled as moderate predominantly exhibit reliability values concentrated in the moderate range. By combining both the silhouette analysis and Moran’s I, the clustering framework is validated from both intrinsic and spatial perspectives, ensuring robust and interpretable classification results.
5. Conclusions
This research proposes a machine learning-based approach to assess the reliability of polygonal building data in OSM by leveraging historical edit information and intrinsic feature attributes. Six key criteria were extracted for each polygon: the number of versions, creation date, last edit date, number of tag edits, cumulative area change, and number of contributors. These criteria were subsequently normalized using fuzzy sigmoid functions to ensure comparability across different value ranges. To evaluate the reliability class of VGI data and assign it to individual data records, a computational algorithm and framework were developed. This system provides a valuable supplementary mechanism for enhancing data quality and enables the implementation of quality control filters within a VGI-based geospatial information system.
In this study, we propose an analytical framework based on data mining and machine learning—specifically, unsupervised learning methods—to estimate the reliability of VGI and assign an appropriate reliability category. By extracting influential features from the data, including spatial, temporal, and other relevant attributes, the reliability of each data record was classified into one of five categories: Very High, High, Moderate, Low, and Very Low.
The spatial distribution of building polygon reliability classes, as illustrated in
Figure 5, reveals distinct clustering patterns across the study area. The Moderate class, which constitutes the largest proportion (52.1%), is predominantly concentrated in the northwestern and southwestern sectors. At the same time, it appeared scattered in other parts of the city. Apart from these moderate-class concentrations, the remaining areas mainly consist of buildings with Low reliability (28.9%), notably concentrated in the northwestern sector, with no clear or consistent spatial pattern elsewhere in the urban landscape. In contrast, the Very Low reliability class (12.4%) is primarily concentrated in the central part of the study area, where it forms the majority presence, indicating a strong spatial focus rather than a dispersed distribution. The High reliability class (4.0%) is sparsely distributed, with clusters in certain planned urban districts. The Very High reliability class, representing the smallest proportion (2.6%), is scattered throughout the study area in small, contiguous groups of buildings. These clusters typically consist of several adjoining structures that collectively exhibit high geometric consistency and precise spatial data capture. Overall, the spatial configuration of building polygon reliability demonstrates clear differentiation in distribution across reliability classes.
Future research could explore extending its applicability across various geographic regions and VGI platforms. Additionally, incorporating a broader range of quality indicators may further improve the reliability estimation of polygon features by capturing more diverse aspects of data quality.