A Regionalization Approach Based on the Comparison of Different Clustering Techniques

Aguilar Colmenero, José Luis; Portela Garcia-Miguel, Javier

doi:10.3390/app142210563

Open AccessArticle

A Regionalization Approach Based on the Comparison of Different Clustering Techniques

by

José Luis Aguilar Colmenero

¹

and

Javier Portela Garcia-Miguel

^1,2,*

¹

Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Information Technology and Computer Science, Universidad Complutense de Madrid (UCM), 28040 Madrid, Spain

²

Department of Statistics and Data Science, Faculty of Statistical Studies, Universidad Complutense de Madrid (UCM), 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10563; https://doi.org/10.3390/app142210563

Submission received: 15 October 2024 / Revised: 7 November 2024 / Accepted: 12 November 2024 / Published: 15 November 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

For biodiversity conservation and the development of protected areas, it is essential to create strategic plans that ensure the preservation and sustainable use of natural resources. Biogeography plays a crucial role in supporting these efforts by identifying and categorizing geographic areas (regionalization) that represent different biotas, as well as recognizing patterns in biodiversity distribution. Another application of regionalization is in planning species sampling and inventories. Developing a species list is vital for monitoring and understanding diversity patterns. This study focuses on the Palearctic region, specifically the areas between Morocco, the Iberian Peninsula, and France. Its aim is to compare different clustering algorithms—such as K-means++, DBSCAN, PD-clustering, Infomap, and federated heuristic optimization based on fuzzy clustering—with a reference regionalization, using environmental and soil data. Various spatial contiguity approaches were applied, including the third-degree polynomial model and principal coordinates. The results demonstrated that the hybrid approach offers a robust solution in the construction of the regions and that K-means++ and PDC produced regions with strong spatial similarity to the reference regionalization, closely aligning with the expected number of regions, especially at the biome level. Our study shows that a purely statistical regionalization can approximate a global reference regionalization, making it reproducible.

Keywords:

biodiversity; biogeography; K-means algorithm; DBSCAN algorithm; PD clustering algorithm; regionalization; Infomap; fuzzy c-means

1. Introduction

It is well known that biodiversity is being lost [1,2,3]. Its loss is mainly due to the development of infrastructure [4]. Therefore, strategic plans at the global level are essential to mitigate this situation. At the last Biodiversity Summit in 2022 in Canada, participants approved a series of measures with the main objectives of addressing the loss of biodiversity and restoring ecosystems. In this context, biogeography is a fundamental tool for the conservation of biodiversity [3,4,5,6]. It helps define and categorize geographical areas that represent biota [6,7] and recognize the main patterns in the territorial distribution of biodiversity [8]. The methods used in biogeography are important for identifying biodiversity hotspots, regions with high levels of taxa, and bioregions with endemism. Identifying these areas can help design protected areas that can more effectively conserve biodiversity [9]. Another aspect of the need to characterize or identify geographic areas, hereinafter referred to as regionalization, is the possibility of planning sampling efforts and designing species inventories. The objective of creating a species list aligns with the goals of assessing species richness or complementarity between areas for use in monitoring or describing patterns of diversity [10]. If collection locations or species lists are distributed across each of the regions [8], the probability of obtaining a reliable representation of all existing faunal assemblages would increase. Depending on the organisms studied, climatic, edaphic, and other conditions may have varying effects. For example, a regionalization based on general climatic conditions that is designed for plant species may not apply to other organisms [10], due to physiological differences between organisms and the distinct climatic effects on plants and animals.

Regionalization can occur at different scales, such as continental [11,12], global [5,10,13], or, as is often the case, on a local scale [8,14,15,16,17]. Furthermore, regionalization can have different objectives. In some cases, regionalization is qualitative [5], while in other cases it uses quantitative information [18,19,20,21]. Qualitative regionalization relies on the consensus and knowledge of various stakeholders, such as biologists, naturalists, and farmers. In contrast, quantitative regionalization is based on statistical analysis of biodiversity data. Quantitative models are generally more explicit, repeatable, transferable, and defensible compared to subjective models based on human experience [22].

Recent technological advances have improved the estimation of quantitative regions and/or bioregions. Specialized software has been developed for this purpose (e.g., Biodiverse 2010 [23]; Phyloregion 2020 [24]; synoptReg 2019 [25]). With current technology, it is becoming easier to employ complex clustering algorithms to search for patterns [26,27]. Some studies aim to compare grouping techniques [28,29] to assess the strengths and weaknesses of different algorithms and identify common patterns. Determining a common regionalization and validating it is a complex task. However, by comparing different regionalizations, one can verify the coincidence or overlap of regions, which can help establish the usefulness of regionalization and facilitate discussion based on its specific characteristics [30].

A regionalization method using climate and soil data is proposed to identify regions with distinct environmental patterns. To achieve this, a comparison of various clustering methods is presented aimed at generating environmentally homogeneous regions by applying different methodologies across several spatial contiguity scenarios.

In the literature, there are several studies examining the use of clustering techniques for regionalizing study areas. For example, Bloomfield et al. (2018) [12] compare various clustering methods to identify bioregions in Australia based on the distribution of Acacia and Eucalyptus species. Similarly, Patrick, Pata, et al. (2024) [31] introduce enhancements to the K-means algorithm to delineate bioregions based on zooplankton distribution. Additionally, Grassi et al. (2020) [32] compare hierarchical, spatial, and spectral clustering techniques. Another example is Pampuch, Luana et al. (2023) [33], who applied multiple techniques to achieve climatic regionalization across South America. Similar to Zhao, Wenhao et al. (2022) [26] compare various clustering techniques to achieve a regionalization of China’s land; this work discusses some of those methods in the context of environmental regionalization.

Clustering methods with diverse characteristics and conceptual approaches were used to examine which ones best fulfill the goals of a coherent regionalization, one that is, to some extent, aligned with a reference regionalization commonly used in the literature, derived from hybrid methods including expert consensus (2017©Resolve Ecoregions, Olson et al., 2001 [5]). The five clustering methods employed each represent a different approach, and it is believed that their comparison provides insight into which methods are most suitable for constructing regionalizations based on environmental data.

The clustering methods used are as follows: (a) a traditional method that is commonly accepted and widely used in the regionalization clustering literature, K-means [34]. A quick search on the Web of Science for the use of clustering in regionalization studies shows that, of the articles using K-means in the last two years, 60% are Q1, indicating a certain level of scientific acceptance of this method. See, for example, [26,32,33]; (b) a method suited for large datasets, useful in cases where data may be organized by pixels, as in our case, such as DBSCAN [35]; (c) a probabilistic method, the PD-clustering algorithm [36], which has been used on occasion for environmental regionalization and was considered promising for our purposes; (d) a hybrid method, federated heuristic optimization based on fuzzy clustering [37]; and (e) a community detection-based method, Infomap [38].

Two approaches were used to incorporate spatial contiguity; the first approach incorporated a third-degree polynomial (Lat + Long + Lat² + Lat × Long + Long² + Lat³ + Lat² × Long + Lat × Long² + Long³), commonly known as trend surface analysis (TSA) [39]. The second approach incorporated principal coordinates as spatial features (PCO) [40]. Spatial overlap with a reference regionalization was used to determine which regions are homogeneous regardless of the applied method and which are heterogeneous or borderline. Non-hierarchical clustering algorithms were selected for comparison with a reference regionalization, as this approach was considered the most appropriate [41]. This methodological comparison was applied to a sub-Palearctic area that includes the regions of Morocco, the Iberian Peninsula, and France.

2. Materials and Methods

2.1. Study Area

This study focuses on the geographical area encompassing France, the Iberian Peninsula, and Morocco, covering an approximate area of 1672.351 km².

The environmental conditions of this study area are highly diverse. Morocco experiences both Mediterranean and desert climates, with high temperatures and low rainfall influenced by the Sahara. The Iberian Peninsula has a Mediterranean climate in the south and east, characterized by mild winters and dry summers, while the northern region experiences a more Atlantic climate, with moderate temperatures and frequent rainfall. In France, the climate ranges from oceanic in the west, with regular rainfall and mild temperatures, to more continental in the east, where winters are colder and summers warmer.

For this area, more than two million coordinates with environmental information are available.

2.2. Variables Used

the bioclimatic variables and potential evapotranspiration have been obtained from WorldClim version 2.1 [42] at a spatial resolution of 30 arc-seconds, and the pH data were obtained from ISRIC World Soil Information [43] (see Table 1).

When necessary, pixels from all geographic layers were reprojected to a resolution of approximately 30 arc-seconds and in order to eliminate the scale difference of the original data value, all variables were normalized between 0 and 1 by the min–max method.

x_{s c a l e d} = (x_{i} - x_{m i n}) / (x_{m a x} - x_{m i n})

Typically, in similar studies [12], dimensionality reduction methods such as principal component analysis are used. However, this approach can obscure the direct interpretability of each variable’s contribution to the estimated regions. To address this, it was proposed that algorithms with all features be used without dimensionality reduction in order to assess the influence of each variable in the construction of the regions. For this purpose, the Gradient Boosting Machine (GBM) algorithm of Friedman [44] was applied, using as dependent variables the regions resulting from the model ultimately selected as appropriate and as explanatory variables the variables used in its construction.

2.3. Clustering Methods

We applied K-means and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Both methods included in the family of non-hierarchical clustering [45,46], aimed at obtaining a direct representation of the relationships between objects instead of a summary of their hierarchy [47].

The K-means++ algorithm, a variant of K-means, was applied, which was proposed by David Arthur and Sergei Vassilvitskii in 2007 [46]. This algorithm is used to select initial centroids more efficiently. Instead of choosing centroids randomly, distance calculations are used to select the initial centroids based on their proximity to each point. This process is repeated until the desired number of centroids is obtained. Then, the conventional Kmeans algorithm is applied to cluster the data (see Algorithm 1).

Algorithm 1: K-means++

1.: Input: $x_{i}$ point data of X.
2.: K: number of cluster
3.: Method:
4.: Select a sample point x_i randomly from X
5.: $C = x_{i}$
6.: While $C < K$ do
7.: New $C_{i}$ , choosing $x \in X with probability D_{(x)}^{2}$
8.: $C = C U {x_{i}}$
9.: repeat
10.: Assign all points to cluster $C_{i}$ .
11.: For each $i \in {1, \dots k}$ , set the cluster $C_{i}$ to be the set of points in X that are closer to $c_{i}$ than they are to $c_{j}$ for all j = i.
12.: For each $i \in {1, \dots k}$ , set $c_{i}$ to be the center of mass of all points in $C_{i}$
13.: Repeat Steps 10 and 11 until C no longer changes.

The choice of the number of clusters is based on the evaluation of the distribution of the intra-cluster sum of squares as a function of the number of clusters. The number of clusters at which the slope is not significantly high compared to successive values is selected.

The number of groups to be formed was determined based on three criteria:

a. Intra-cluster distance: measures the distance between the members of the same group; it is interesting to minimize it. It is a matter of choosing the number of clusters where the reduction in the intra-cluster sum of squares seems to stabilize, indicating that this cluster may be a good choice.

b. Inter-cluster distance: measures the distance between the clusters; it is important to maximize it. It is a matter of choosing the number of clusters where the sum of squares between clusters is maximum, indicating that this cluster may be a good choice.

Another algorithm used, DBSCAN [35], was proposed for large spatial datasets. It uses two parameters, “epsilon” (eps) and “minPoints” (minPoints), to detect points in clusters and points that do not belong to any cluster, known as noise, this procedure is depicted in Algorithm 2. The objective is to identify dense regions of points separated by sparse regions. The choice of the parameters eps and minPoints is crucial for the success of the algorithm. The eps parameter defines the neighborhood radius around a point, and the minPoints parameter represents the minimum number of neighbors within that radius. By using a neighbor search method, DBSCAN performs clustering within the main clusters, known as “Core” clusters. Points on the border have fewer neighbors but are still considered part of the cluster. If a point is neither in any cluster nor on the border, it is classified as noise. For neighbor selection, the algorithm K-NN with Euclidean distance and k-dimensional trees (k-d trees) have been used, with k-d trees being more efficient in terms of performance.

Figure 1 shows an example of selecting the optimal eps using data from our own dataset.

The optimal epsilon is the one that lies at the point of maximum curvature on the k-distance graph. If the eps value is too small, clusters with higher dispersion will be classified as noise, while if the eps is too large, denser clusters may merge [47]. For our study, we aimed to match the number of regions with those in the reference dataset, so different eps values from the k-NN distance curvature graph were tested.

Algorithm 2: DBSCAN

1.: Input:
2.: Radius = epsilon
3.: minPoints
4.: Method:
5.: Select all points x_i of X then “no visit”
6.: ${\forall x}_{i} \in X :$
7.: if $x_{i}$ is visit then “visit”
8.: get neighborhood Np of $x_{i}$ , all points within a radius (epsilon)
9.: if size of Np < minPoints then
10.: $x_{i}$ is noise
11.: repeat
12.: else
13.: C = next cluster
14.: $x_{i} \in$ C
15.: Seed set = Np/ $x_{i}$
16.: ${\forall y}_{i} \in S e e d s e t :$
17.: if $y_{i}$ is noise $\to C$
18.: if $y_{i}$ visit then visit
19.: get neighborhood Np of $y_{i}$ all points within a radius (epsilon)
20.: if size of Np of $y_{i}$ < minPoints then
21.: Continue
22.: $S e e d s e t \to S e e d s e t \cup N p$

Another algorithm applied was PD-clustering, or probabilistic distance grouping (PD). This is an iterative method of clustering probabilistic distributions without assuming a specific distribution. This method assigns each observation to a probability and distance group for each center.

The main function used to measure the distance between an observation and all the cluster centers is the joint distance function (JDF) of

x

. For the entire dataset, the JDF is the sum of the JDF values for each observation. Finally, the JDF is the harmonic mean of the distance.

The calculation is as follows:

p_{k} (x) = \frac{\prod_{j \neq k} d_{j} (x)}{\sum_{t = 1}^{K} \prod_{j \neq t} d_{j} (x)}, k = 1, \dots ., K

The

d_{j} (x)

distance of observation (

x

) and center of cluster (

c_{k}

), using, for example, Mahalanobis distance is as follows:

d_{j} (x) = \sqrt{{(x - c_{k})}^{T} \sum_{k}^{- 1} (x - c_{k})}

When the probability is 1, the value of the JDF is 0, meaning the observation matches the cluster center. If the JDF value in the training set is not significantly different from the value in the test set, then we can say that the algorithm is adequate. As in other algorithms, users have to specify the number of clusters. This procedure is depicted in Algorithm 3.

Algorithm 3: PD-Clustering

1.: Input:
2.: $x_{i}$ point data of X
3.: K: number of clusters
4.: Method:
5.: Select all points x_i of X
6.: Select K cluster
7.: For each K:
8.: probabilistic distance of x_i on K: D(x_i,K)
9.: Calculate Joint distance function (JDF): harmonic mean of all D
10.: Centers are max(D(x_i,K)) for all x_i

When applying federated heuristic optimization based on fuzzy clustering, a technique presented by Polap Dawid et al. [37] at the International Conference on Fuzzy Systems (FUZZ 2023), the approach leverages fuzzy clustering within federated learning frameworks to optimize complex, distributed systems (see Algorithm 4) This method is specifically designed to address the privacy and security constraints common in federated environments, where data remain decentralized. Fuzzy clustering (specifically, fuzzy c-means) is used to handle data uncertainty by clustering similar data points, even when characteristics overlap. This enables each local model to contribute to a global optimization objective without directly sharing raw data. In this work, this methodology was adapted to our specific data and requirements using the R software libraries fclust [48] for applying fuzzy c-means and GA [49] for utilizing a genetic algorithm to optimize the number of regions. In the federation data, the centroids of K-means++ were taken as local nodes, so that each node or region retained its specific environmental data. By applying fuzzy c-means to each region, environmental data points that might belong to multiple regions were identified, thereby locating areas with shared environmental conditions, typically found at the borders between regions—such as climatic transition zones. For heuristic optimization, a genetic algorithm was used to optimize the number of clusters at a regional level, aiming to find the number of clusters that minimized intra-cluster error. Once each region had optimized its local model using fuzzy clustering and heuristic optimization, the results were centralized in a federated model by combining the membership matrices and deriving a global regionalization by selecting the maximum values from these matrices.

Algorithm 4: Federated heuristic based on fuzzy clustering

1.: Input:
2.: $x_{i}$ point data of X
3.: Centers
4.: Iterations: number of centers
5.: Method:
6.: Assign Centers <- centroids (previous other model)
7.: {n1,n2,…n_i} <- Randomly select observations
8.: function fuzzy clustering:
9.: fuzzy c-means
10.: U <- matrix membership
11.: C <- fuzzy centers
12.: for i = 1 at iterations do
13.: n_fcm_i <- function fuzzy clustering(n_i)
14.: ga_i <- genetic algorithm (n_fcm_i)
15.: end for
16.: federated centroids <- mean (centers ∈ n_fcm_i)
17.: for n_i do
18.: m_fcm_i <- function fuzzy clustering(n_i, federated centroids)
19.: ga_i <- genetic algorithm (m_fcm_i)
20.: end for
21.: federated centroids <-mean(centers ∈ m_fcm_i)
22.: average membership <-mean(membership ∈ m_fcm_i)

C_fuzzy <- result(ga_i)

For the fuzzy component, we applied k-fold cross-validation with k = 5 to optimize parameters for the best results. The diffusivity parameter (m) was tested over a range from 1.1 to 2 in increments of 0.1. The maximum number of iterations was set to 30, using the Euclidean distance to evaluate the mean square error at each step. For the genetic algorithm applied to the fuzzy function, the maximum number of iterations was set to 100 with a population size of 1.000, and the algorithm type chosen was “real-valued”.

The final algorithm we applied was Infomap [38], a graph-based approach aimed at mapping the organization of interactions within an environmental network (see Algorithm 5). This method detects communities in complex networks using information theory to identify modules or clusters. Each network consists of various elements and the connections between them—commonly known as edges in graph theory—which can be weighted according to criteria such as similarity or distance among network elements. For our study, we used a distance matrix as the edges, employing Euclidean distance, which showed shorter processing time compared to alternatives like Manhattan or Canberra distances.

To implement this algorithm, the number of variables and observations had to be reduced, as the high volume of data led to complex connections and longer processing times. Following Bedia et al., 2013 [50], correlated variables were removed, leaving uncorrelated ones (bio02, bio03, bio08, bio13, bio14, bio15, and Evapo), and their values were scaled to a range of (0,1). The igraph package [51] in R, which includes Infomap, was used. Unlike algorithms such as K-means, Infomap estimates the optimal number of clusters or communities, doing so in two iterative phases where the map equation (objective function) reaches a stable minimum value.

(1): Assigning Nodes to Communities: In this phase, the algorithm initially assigns each node to a community and evaluates the impact on the map equation. A community node is modified, and the resulting reduction in the map equation is calculated. The goal is to minimize the map equation by shifting nodes between communities.
(2): Clustering Communities into Meta-Communities: In the second phase, the communities obtained in the previous step are treated as nodes at a higher level, forming a meta-community network. The algorithm is applied again to this clustered network to optimize the flow of information within the community structure. This process is repeated recursively until the map equation converges. In this phase, different assignments have been made to meta-communities depending on the percentage of nodes close to the source node (Table 2). In this way, we can check how the number of groups has varied and select the case that best suits our objective.

Briefly, the map equation serves as an objective function that measures how efficiently the information in a network is compressed by partitioning it into communities. This equation is based on the entropy of the paths that an information flow might follow within the network, representing the number of bits required to describe a random walk through the community structure. It is defined as follows:

L = q H (Q) + \sum_{i} p_{i} H (P_{i})

q

: Probability of leaving a community;

H (Q)

: Entropy of community nodes;

p_{i}

: Probability of staying within a community I;

H (P_{i})

: Entropy of nodes within community i.

To select the closest connections between nodes, we have defined a threshold to create a network structure consisting of the nearest nodes, thereby enabling Infomap to identify communities more directly. Additionally, we have ensured that the diagonal of the distance matrix is zero to avoid self-connections. Below is a pseudocode representation of this algorithm adapted to our work.

Algorithm 5: Infomap

1.: Input:
2.: $x_{i}$ $point data of X . Each x_{i}$ is a node.
3.: D(x). Euclidean distance
4.: Method:
5.: D(x)
6.: Threshold <- quantile(D(x))
7.: $D (x) \leq T h r e s h o l d$
8.: $diag (D (x) \leq T h r e s h o l d$ ) <- 0
9.: phase 1:
10.: $\forall x_{i} \in G$ do
11.: if L is min() then
12.: $x_{i} \in c_{i}$
13.: phase 2:
14.: $Group c_{i}$ $in C_{i}$
15.: $D (C_{i})$
16.: repeat phase 1
17.: repeat until L no longer changes.

In summary, it has been observed that different algorithms differ in how they construct clusters and in their methods for determining the number of clusters. K-MEANS++ relies on information regarding the distances between observations and cluster centers, whereas DBSCAN identifies the number of clusters based on the density of observations within a given radius, epsilon (“eps”), which must include a minimum number of points (“minPoints”) relative to the center. In the case of PD-clustering, the number of clusters is user-defined, and clusters are formed based on the probability that an observation belongs to the cluster’s center. Infomap constructs clusters by grouping nodes together, thereby minimizing the amount of information needed to describe movements between communities. This iterative process optimizes the map equation through local reassignment of nodes and the grouping of communities at higher levels until the best flow retention is achieved.

2.4. Spatial Contiguity

To build the regions, climatic, soil pH, and various spatial variables were used to facilitate the formation of groups of localities that, in addition to sharing environmental characteristics, are spatially contiguous. In this manner, the importance and spatial coherence [52] of contiguity in the construction of regions were verified. For this, two different scenarios were analyzed. In the first scenario, methods without spatial contiguity were analyzed, while in the second scenario, spatial contiguity was incorporated into the construction of the regions.

The variables were standardized with a mean of 0 and a variance of 1. The geographical coordinates were incorporated in two different ways: 1. As a third-degree polynomial (Lat + Long + Lat² + Lat × Long + Long² + Lat³ + Lat² × Long + Lat × Long² + Long³) [39]. 2. Principal coordinates were incorporated [53]. In this case, the principal coordinates (PCO), calculated using the Mahalanobis distance matrix of the UTM coordinates of each cell, are incorporated as predictor variables. This matrix was weighted by a terrain elevation model (MDE), with the maximum points weighted proportionally to the elevation value of the cell [54].

2.5. Comparison of Regions and Selection of Results with Spatial Coherence

As mentioned above, validating a regionalization is complicated except for statistically demonstrating that the regions obtained differ from each other based on the variables used. However, its results can be compared to other regionalizations accepted by the scientific community. The main idea is to check if the presented regionalization is similar to a reference regionalization; if so, its usefulness could be accepted. We compare the regionalization estimated in this work with the 2017©Resolve Ecoregions, which is a revision by Olson et al., 2001 [5].

According to Olson’s regionalization, our study area is classified into 22 ecoregions (see Figure 2a) and 5 biomes (see Figure 2b).

To compare the regions obtained from the different algorithms, the kappa index provided by QGIS 3.22.3 software was used [55]. The kappa coefficient [56] by Cohen, 1960 is as follows:

k = \frac{P_{0} - P_{e}}{1 - P_{e}}

where

P_{0}

is the proportion of cases correctly classified and

P_{e}

is the expected proportion of cases correctly classified. The value of k can range between −1 and 1, with 1 indicating a perfect comparison, 0 indicating no agreement at all, and −1 potentially indicating a chance effect, which is difficult to measure [57]. There are studies [58] that have questioned the application of the kappa index as a measure of accuracy. However, its use is appropriate for our purposes here, as it is designed as a measure of the degree of agreement between two independent evaluators. Finally, we selected the result from the algorithm with the highest kappa index compared to the other algorithms and the reference regionalization.

We have applied the silhouette score as a measure to evaluate the quality of results. This index examines how similar an object is to its own cluster compared to other clusters. The silhouette value for a single data point is calculated using the following formula.

s_{i} = \frac{b_{i} - a_{i}}{m a x (a_{i}, b_{i})}

where

s_{i}

is the silhouette score for point i;

a_{i}

is the average distance between point iii and all other points in the same region;

b_{i}

is the average distance between point iii and all points in the nearest region.

2.6. Importance of Variables in the Construction of the Regions

To understand the construction of the regions, we estimate the influence of the variables of the algorithm with the highest kappa index obtained. To do this, we applied the gradient boosting algorithm using the gbm package by Greenwell et al. [59] in R software version 4.4.1. This algorithm creates a set of individual decision trees sequentially, with each new tree attempting to correct the errors of the previous trees. To obtain a prediction for a new observation, the predictions of all the individual trees in the model are combined.

In this work, the usefulness of the approach was to determine the influence or relative importance of the variables in the construction of the regions. To achieve this, the solutions of the algorithms were used as the dependent variable (e.g., a discrete variable with groups 1 through 20). The data were split into 70% for training and 30% for testing. A total of 3568 iterations were performed with a grid of different hyperparameters, selecting the model with the lowest

R M S E

(root mean square error) as the final model.

R M S E = \sqrt{\frac{1}{n} {\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}}

where

n

: The number of observations;

y_{i}

: Actual values;

{\hat{y}}_{i}

: Predict values.

The hyperparameters evaluated in the grid are detailed below:

Learning rate (shrinkage): The following learning rates were considered from 0.001 to 0.5 with a step of 0.1.
Maximum depth of the trees (interaction.depth): Depths from 1 to 20 were tested.
Minimum number of observations in the leaves (n.minobsinnode): Values of 10, 20, and 30 were evaluated.
Fraction of data used in each iteration (bag.fraction): Fractions from 0.1 to 1 were considered, with a step of 0.1.

Cross-validation (cv.folds): 5 was considered.

3. Results

3.1. Regionalization Without Spatial Contiguity

In the case of K-means++, the number of regions with a less significant slope in terms of distance observed in both the into-group and inter-group sum of squares curves was five regions (Figure 3).

Therefore, we selected this number of regions as appropriate. Furthermore, we also considered a solution with 22 regions to be able to compare with Olson’s ecoregions in our study area; Figure 4b,c show these results.

According to the results with PDC, for both 5 and 22 bioregions, the probability of assigning a cell or pixel to one region or another is high, resulting in a mix of regions (Figure 4d) that cannot be distinguished. To address the small differences between probabilities, we decided to regroup the regions. To do this, we performed a Student’s t-test to verify the equality of the average probability assigned to each cell and thus facilitate the regrouping of the regions. Regions 1 and 5 of the model with 5 regions (Table A1) have similar mean probabilities, with the test showing a non-significant result (p-value = 0.999). Other regions with similar mean probabilities according to the test are regions 2 and 4 (p-value = 0.874). We then regrouped into three regions: region 1 was combined with region 5, region 2 was combined with region 4, and region 3 did not need to be regrouped. For the model with 22 regions, according to the results of the Student’s t-test (Table A1), we grouped regions with a p-value equal to or greater than 0.9. Region 1 was combined with regions 2, 3, 5, 11, 14, 17, 19, 21, and 22. Region 4 was combined with regions 6, 7, 8, 9, 12, 13, 15, 16, 18, and 20. The remaining regions did not need to be regrouped. Consequently, we consolidated all the regions into three, resulting in a configuration similar to the one described above.

DBSCAN does not show satisfactory results; almost all the observations belong to a single region (Figure 4a). One hundred iterations were carried out to obtain the most coherent result possible relative to the reference regionalization. Additionally, it produced the most heterogeneous results possible with minimal noise. We ultimately selected the model with the following parameters: eps = 79 and minPoints = 355. This resulted in 21 regions, with one region occupying 78% of all the cells and extending practically throughout the study area.

The results obtained with Infomap depend on the percentage of the closest nodes selected for community formation. These closest nodes correspond to those with the shortest distances from the source node. In Table 2, we can observe different quantiles applied and the resulting number of regions. Since our goal is to achieve a number of regions comparable to those in the reference regionalization, we selected the result with five regions obtained by choosing 30% of the closest nodes. The regions identified (Figure 5a) present a map that is challenging to compare with Olson’s regionalization; regions 4 and 5 are nearly residual, with the majority of the study area situated in region 2, while regions 1 and 3 are distributed throughout region 2 as spatially independent communities. Notably, areas classified as region 1 are located within region 3. Given the scarcity of regions 4 and 5, we decided to rerun the algorithm until we achieved a solution with three regions. To do this, we increased the proximity percentage to 50%. Consequently, 50% of the closest nodes were selected based on their distance. The result (Figure 5b) is a map with a more logical spatial arrangement, where region 1 occupies almost the entire territory, leaving zones 2 and 3 in the southern part of Morocco.

3.2. Regionalization with Spatial Contiguity

The regions produced by K-MEANS++ for each of the two contiguity approaches (PCO, TSA) exhibit a common pattern in the construction of the regions when considering 5 regions (Figure 6a,b). Region 1 for both methods is primarily located in France; with the PCO model, this region extends up to the Cantabrian mountain range. Region 2 is mainly concentrated in the center of the Iberian Peninsula and the Atlas region of Morocco. Region 3 is concentrated in areas close to the Atlantic Ocean. Region 4 extends from the Anti-Atlas to the Sahara-Atlas, while region 5 is mainly located in Morocco and extends into the westernmost part of the Iberian Peninsula, with an incursion into the southeast of the Iberian Peninsula.

In the case of the 22 regions (Figure 6c,d), the two methods applied do not produce a similar pattern. When applying PCO, we verified that this method results in fewer divisions within the same area compared to TSA. France shows fewer regions or divisions with PCO than with TSA, as does the Iberian Peninsula and Morocco. In the latter area, we found some regions with a common pattern between the two methods in the southeast of the country.

For the Infomap algorithm, we obtained a number of clusters that can be compared a priori with the reference regionalization when we selected 25% of the closest nodes (Table 2). The results vary depending on the contiguity approach used. In the case of TSA contiguity (Figure 7a), the spatial ordering we obtained is very similar to that seen with K-means++ or PDC, revealing a certain level of order. However, when we include contiguity as the principal coordinate for k=5, the result does not display any discernible pattern or coherent community (Figure 7b1). To improve this outcome, we selected 35% of the closest nodes, resulting in three regions (Figure 7b). With this number of regions, we can observe how communities with similar environmental characteristics have been grouped according to the algorithm; we particularly highlight region 3, which is located in the southeastern part of France.

The results obtained with PDC are different depending on the technique used. When we apply TSA, both for the cases of the 5 and 22 regions, we observe that there are areas in which a specific region is not distinguished, similar to what was seen with the algorithm that did not include spatial contiguity. We decided to regroup these cells, for which a t-test was performed to verify the equality of the average probability assigned to each cell (Table A1), and thus regroup the regions. For the regions obtained with the model that includes contiguity as a third-degree polynomial (TSA) and 5 regions, regions 1 and 4 present similar mean values, with a non-significant contrast (p-value = 0.999). Similarly, regions 2 and 5 have similar mean values, with a non-significant contrast (p-value = 0.97). We decided to regroup them into three regions. In the case of the 22 regions, the result is a regionalization with three groups as before. We regrouped them into four regions based on a similar average probability according to the t-test. Regions 1, 3, 4, 10, 11, 14, 18, 21, and 22 were grouped into one region; regions 2, 5, 6, 7, 8, 12, 13, 15, 16, 17, and 20 were grouped into another region; and regions 9 and 19 merge into another region.

Figure 8 shows the regionalization after grouping the clusters. This result is identical to the regionalization obtained from the grouping without considering spatial contiguity. It could be said that if the number of regions into which the territory is divided is small relative to the total area, spatial contiguity is not a significant factor in constructing the regions.

In the case of including principal coordinates (PCO), regrouping of the regions is not necessary; it seems that this contiguity technique yields more spatially coherent results. It shows a certain circular pattern for the cases of the 5 and 22 bioregions. For both cases, region 2 is located in the center of the Iberian Peninsula and the southeast of Morocco, with region 1 surrounding region 2. The remaining regions surround the preceding ones, and so on (Figure 9).

To obtain a number of regions similar to the reference regionalization, we performed 100 iterations when applying DBSCAN. In the case of incorporating contiguity as a third-degree polynomial (TSA), we obtained a number of clusters approximately equal to that of the reference regionalization with the parameters eps = 360 and minPoints = 80. This resulted in 19 regions, with Region 1 occupying 89% of all cells, while the remaining regions occupied very few cells and were almost insignificant. When we included the principal coordinates (PCO), the result was very similar, producing a single region covering almost the entire study area, this time 75%. This result is less extreme compared to the TSA case, demonstrating how the incorporation of spatial eigenvectors slightly improves the regionalization result. Both results are comparable to the one obtained previously without considering spatial contiguity (Figure 4a).

3.3. Comparison of the Regions

Table 3 shows the kappa index for the comparison of regions between the different algorithms and the reference regionalization. The highest kappa index is observed in the comparison between the regions obtained with PDC without including contiguity (three regions, k = 3) and K-means++ including spatial contiguity as a third-degree polynomial (five regions, K = 5), with a kappa value of 0.4159. This is followed by a kappa index of 0.356 in the comparison between K-means++ with k = 5 and spatial contiguity as a third-degree polynomial with the Federate model. Another comparison with the high index (0.3205) is the comparison between PDC (k = 3) and K-means++ (k = 5), both considering spatial contiguity as a third-degree polynomial. Regarding the comparison with the reference regionalization, the highest kappa index is shown by the comparison between the biomes and the three regions obtained with PDC, including spatial contiguity as a third-degree polynomial and k = 3, with a value of 0.2427. Another algorithm that presents a positive kappa index with the reference regionalization (0.2076) is K-means++ for the case of the 5 regions, including spatial contiguity as a third-degree polynomial. Infomap, using TSA contiguity, also exhibits a positive index regarding the Olson regions for k = 5. This algorithm is spatially comparable with other algorithms, as it demonstrates a high kappa index when compared to the regions obtained through K-means++ and PDC, both of which also utilize TSA contiguity for k = 5. Infomap TSA, K-means++ TSA, and PDC TSA appear to be good candidates for a coherent regionalization since they have a high degree of comparability between them and Olson’s biomes.

The kappa index to compare 22 regions obtained with the different algorithms and Olson’s ecoregions does not show results that indicate a significant degree of positive overlap. This suggests that, when the number of regions to be estimated is high, the algorithms complicate the comparison between purely quantitative regionalizations and non-quantitative regionalizations such as Olson’s. Although the results obtained with PDC for k = 3 present the highest kappa index (0.2427), it is not selected since analyses with so few groups would not be informative for such a large area [48], because k = 3 divided the territory into north (France), center (Iberian Peninsula), and south (Morocco).

The silhouette scores (Figure 10) of regions of the algorithms with a higher kappa index compared to the reference regionalization (DBSCAN PCO with K = 19, Federate model with k = 7; Infomap TSA with k = 5; K-means++ TSA with k = 5; and PDC TSA with K = 3), displayed in boxplot form. In the federated model (as discussed below), all regions, except for region 3, achieved positive scores above the average value of 0.16. This suggests that the points within these regions are well assigned. For the Infomap method, regions 2 and 3 have relatively high values compared to the average of 0.14, while the other regions, particularly region 1, exhibit negative values. This indicates that there are points that are either poorly assigned or that overlap exists between the regions. In the case of K-means++, all regions except for region 3, which shows negative values, generally exhibit positive silhouette scores, with an average fixed at 0.16. The three regions estimated using PDC also show predominantly positive silhouette scores, with an average value of 0.32. Notably, regions 1 and 3 significantly exceed this average, indicating that the allocation of points to these regions is appropriate. For DBSCAN, certain regions (6, 8, 9, 12, 15, and 19) show high silhouette scores, close to 1. This suggests that the observations within these regions are well classified. However, as previously shown (Figure 4a), the majority of observations are grouped into a single region, which makes it difficult to compare this result with any reference regionalization. The results obtained with Infomap TSA and the federated model provide robust and logical solutions, and K-means++ for k = 5 and spatial contiguity as a third-degree polynomial (from now on, K-means++, k = 5, TSA) align particularly well with the reference regionalization. This alignment improves the reliability of the obtained regions, which highlights the robustness of this method for effective clustering, and for this reason we consider it the most optimal solution.

3.4. Federated Heuristic Optimization Based on Fuzzy Clustering

Using the centroids of the regions obtained from K-means++ k = 5, TSA, we designed a federated model to optimize the number of regions using genetic algorithms and fuzzy clustering. By applying cross-validation with five blocks, the average number of regions across all iterations (we performed 100 iterations) was seven (see Figure 11).

The regions obtained through this hybridization of techniques have resulted in heterogeneous areas, particularly for Morocco and the Iberian Peninsula, with only three environmentally defined regions identified in the French territory. We compared this regionalization with Olson’s biomes using the kappa index, resulting in a value of 0.069, and with the ecoregions, we obtained a value of 0.003. Both values indicate a poor spatial coincidence; however, the ordering produced by this hybrid model is the most favorable based on the silhouette index scores presented in Figure 10. Therefore, this regionalization provides a robust solution despite its lack of resemblance to Olson’s regions.

3.5. Most Important Variables in the Construction of the Regions

The regions resulting from the previously selected model, K-means++ for 5 regions with spatial contiguity as a third-degree polynomial, have been used as dependent variables in a GBM algorithm. After iterations, the best fit obtained an RMSE of 0.3661 in the training phase. The three variables with the highest significance (Table 4) were as follows: Annual Mean Temperature (Bio1), Mean Temperature of Coldest Quarter (Bio11), and Precipitation Seasonality (Coefficient of Variation) (Bio15).

4. Discussion

Regionalization can be considered simply as a classification scheme through which internal homogeneity and transboundary heterogeneity are maximized [60], thereby showing the ecological relationships between coexisting organisms and their environment. This paper presents the outcomes of three purely analytical classification techniques using edaphic environmental data under two spatial contiguity approaches, obtaining both homogeneous and heterogeneous results between the models. As previously mentioned, the application of these techniques has several advantages, including interpretability, flexibility, and reproducibility [61].

4.1. Limitations of the Algorithms Used

Throughout this work, we observed that the regions obtained with K-means++ exhibited spatial coherence regardless of whether spatial contiguity was included. Applying PDC, we noted limitations in assigning probabilities to each cell. The probability matrix assigns, in most cases, different regions to neighboring cells, especially in the northern part of the Iberian Peninsula and the entire French region. It was necessary to reduce the number of regions by merging those with similar probabilities to improve the practical application of the results [28]. Using Infomap to build communities from a distance matrix with edges, we often encounter a situation where nodes with a high degree of connectivity (hubs) dominate the information flow. This concentration can lead to bias in clustering when we set the number of regions to five. However, by reducing the number of regions to three, the algorithm is compelled to form larger communities that consider both the highly connected nodes and the more isolated ones. The DBSCAN algorithm has serious limitations when dealing with multidimensional data, such as climatic and/or edaphic data. Despite the unsatisfactory results, we found that incorporating spatial contiguity as principal coordinates provides some context to the results. We achieved coherent region ordination when including principal coordinates as spatial contiguity; this demonstrates how incorporating spatial autocorrelation through orthogonal vectors could improve the results.

4.2. Analytical Regionalization Versus Reference Regionalization

The algorithm-based regionalization approach offers several scientific advantages, including reproducibility, adaptability to changes in data, and the capacity to address ecological questions in a more direct and comparable manner. In our case, when comparing our results with the reference regionalization, we found that the statistical or quantitative construction of regions by certain algorithms, such as Infomap, K-means++, or PDC, aligns with the biome level according to the kappa index obtained, indicating a certain degree of similarity. However, these algorithms do not accurately match regions at a finer level, such as ecoregions.

4.3. Characterization of the Regions

The main variables for the construction of the regions obtained with the accepted model (K-means++, k = 5, TSA) were Annual Mean Temperature (Bio1), Mean Temperature of the Coldest Quarter (Bio11), and Precipitation Seasonality (Coefficient of Variation) (Bio15). These parameters are those that have shown the greatest relative importance (see Table 4) after various adjustments using the gradient boosting algorithm. However, each bioregion has particular characteristics. In the following, we highlight the main characteristics of each region. The climate and precipitation values are obtained from the output of the K-means++, k = 5, TSA model.

Region 1. This region extends mainly through the Cantabrian mountain range without reaching the Cantabrian coast. It is known as the Eurosiberian Orocantábrica region according to the classification of the Spanish National Institute of Geography (hereinafter ING) [62]. It makes an incursion through the Iberian system up to the Moncayo mountain. Throughout this area, oak and beech forests, cryotemperate grasslands, oro-cryotemperate vegetation on rocky fields and scree, supra-arid and high mountain pastures, Cantabrian high mountain scrubland, peatland vegetation, meso-supratemperate scrubland, meadows, and meso-supratemperate grasslands predominate [63]. It continues through the Pyrenees, where mixed and coniferous forests predominate [64], and extends to areas of the coastal mountain range of the Iberian Peninsula. It also occupies almost the entire French region, including the island of Corsica. The only areas not linked to this bioregion are Landes, western Brittany, and the area bordering the Mediterranean Sea from southwestern Provence to Girona. In the French area, the region includes different habitats where the vegetation is composed of conifers, mixed oak forests, and primeval beech forests in the central-eastern area [65]. The average annual precipitation in this region ranges from 700 mm to 1000 mm, and the average annual temperature is between 9 °C and 12 °C in mountainous areas, with minimum average values sometimes falling below 10 °C.

Region 2. This region is divided into three main areas: one in the center of the Iberian Peninsula and a large part of the Balearic Islands; another in the center of Morocco; and the area bordering the Mediterranean Sea, from southwestern Provence to Girona. In the Iberian Peninsula, it extends from south to north, from Sierra Morena to the Northern Plateau, and from east to west, from the border of Spain with Portugal to the Valencian coast. It is characterized by different types of vegetation. In the most central part of the region, we find low scrub plains and silvopastoral forests with a strong presence of agricultural and livestock activity, according to the description of Iberian sclerophyllous and semi-deciduous forests by WWF [66]. Further south, we find pine forests and areas with significant plant diversity, such as Sierra Nevada, which has more than 2100 vascular plants, according to the description of Iberian conifer forests by WWF [67]. On the Moroccan side, the region stretches across the Atlas Mountains, from Jebel to include the northern part of the Sahara Atlas. It is described by WWF [68] as being dominated by high, arid plateaus with steppe grasslands, scrublands, dunes, seasonal salt lakes, and mountainous slopes with pine forests. This region in the Catalan–French part is characterized by sand dunes, salt lakes, and cliffs. Climatically, the ecoregion experiences very hot and dry summers and relatively mild winters on average. Average annual temperatures range from 12 °C to 15 °C, and average annual rainfall ranges from 300 mm to 525 mm.

Region 3. Extending from the coast of Lisbon to La Rochelle and in the northwest of France, it is concentrated in Brittany, with similar edaphic-environmental areas also found in the southernmost part of the Iberian Peninsula, as well as in the north of Morocco, mainly in the Rif. This entire region is bordered by the Atlantic Ocean and connected to the south with the Mediterranean Sea through the Strait of Gibraltar. It presents diverse habitats, predominantly mixed oak forests and heathland along the coast. In Brittany, Atlantic birch forests can be found in nutrient-poor soils. In southern France and in the Cantabrian–Atlantic region of the Iberian Peninsula, there is coastal dwarf scrub on saline cliffs, as well as rock scrub along the Portuguese coasts [69]. In the Andalusian-Moroccan area, we can find thermo-meso-Mediterranean sandy scrublands, as well as Portuguese acidic palaeodunes [69]. The mean annual temperature ranges from 11 °C to 15 °C, with mean annual rainfall ranging from 900 mm to 1300 mm.

Region 4. This region is found exclusively in Morocco; there are no similar conditions outside Morocco within the study area. It extends mainly in the south of Morocco, from the southern limit of the Atlas Mountains to the border with Algeria. In the eastern part, it includes areas belonging to the province of Figuig, and in the westernmost part, it reaches the border with Western Sahara. It essentially features a desert landscape and xeric scrubland [70], including the Erg dunes as well as those of El Hneite, where arid scrubland and desert oases predominate. The region also includes a large area extending from the Agafay Desert to the province of El Kelaa des Sraghna, which encompasses arid soils with a strong presence of agricultural activity, mainly citrus fruits [71] and olive cultivation. The average annual temperature ranges from 12 °C to 32 °C, although it can be higher. The average annual rainfall ranges from 80 mm to 184 mm.

Region 5. It is found in the westernmost part of Morocco, the Canary Islands, and the Iberian Peninsula. In the Iberian Peninsula, it covers the southwest and extends to the east along the entire Mediterranean coastline from the Penibetic Mountain Range to the mouth of the Turia River in the Mediterranean Sea, as well as the westernmost part of the Balearic Islands. It includes important freshwater areas such as lagoons or wetlands, including those of great importance for migratory birds, such as Doñana National Park or the Ebro Delta [72,73,74]. The landscape in the southwest of the Iberian Peninsula is characterized by forests composed mainly of evergreen species such as cork oak [68]. In the east, it is characterized by scrub and pine forest vegetation, as well as extensive areas of cultivation. In Morocco, occupying a coastal area, this region provides refuge for many birds, with one of the most important areas being Souss Massa National Park for the conservation of, for example, the Bald Ibis [75]. Argan forests predominate between the Western High Atlas and the Anti-Atlas, while wild olive and carob trees are found in the northern part of the region. The average annual temperature for the region in the Moroccan zone ranges between 10 °C and 21 °C. The average annual rainfall ranges between 31 mm and 420 mm. The average annual temperature for the region within the Iberian Peninsula ranges between 14 °C and 19 °C. The average annual rainfall ranges between 200 mm and 845 mm.

5. Conclusions

To our knowledge, this is the first quantitative attempt to conduct an environmental regionalization of the study area. We utilized a range of algorithms, each presenting its own advantages and limitations. The process of selecting the most appropriate method that closely aligns with the reference regionalization proved to be a valuable learning experience. We identified several positive results when comparing our results with the reference regionalization; the spatial and environmental coherence of the regions generated by the K-means++ algorithm, which incorporates spatial contiguity through a third-degree polynomial, underscores the analytical robustness of our approach. This method enabled us to achieve a certain level of comparability, particularly at broader classification scales, such as Olson biomes, though not at finer scales, such as ecoregions.

We conclude that the regionalization produced by K-means++ is reliable for high-level environmental classifications. It has shown that the regions obtained are aligned in some way to one of the most used regionalizations in the scientific community and has shown robustness in the construction of regions generally. Consequently, this regionalization can serve as a valuable tool for environmental research, for example, in the identification of climatic patterns that influence the presence/absence of species or in predictive distribution models. The results obtained from the federated model, or Infomap, have demonstrated robustness in the construction of regions, highlighting above all how the hybridization of models can be a powerful tool for clustering. In contrast, other models such as DBSCAN have not achieved an adequate spatial ordering; therefore, it would be worth evaluating updates to this algorithm to assess its performance, such as HDBSCAN.

Author Contributions

Conceptualization, J.L.A.C. and J.P.G.-M.; methodology, J.L.A.C.; validation, J.L.A.C. and J.P.G.-M.; investigation, J.L.A.C.; formal analysis, J.L.A.C.; data curation, J.L.A.C.; writing—original draft preparation, J.L.A.C.; writing—review and editing, J.L.A.C. and J.P.G.-M.; visualization, J.L.A.C.; project administration, J.L.A.C.; funding acquisition, J.L.A.C. and J.P.G.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Thanks to Jorge Miguel Lobo, National Museum of Natural Sciences of Spain, for his knowledge and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. This table includes the estimated P-values from the T-student test to compare the average probability of a cell belonging to one region or another. SC: Algorithms without including spatial contiguity; TSA: Algorithm including spatial contiguity as a third-degree polynomial; PCO: Algorithm including eigenvectors of the distance matrix as spatial contiguity. Significance level: 0.05.

Model	Regions/ Contiguity	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
PDC with 22 regions	1 SC	0.999	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	0.9999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA	2.2 × 10⁻²	0.999	0.999	2.4 × 10⁻²	2.4 × 10⁻²	2.2 × 10⁻²	2.0 × 10⁻²	4.1 × 10⁻²	0.999	0.9999	3.9 × 10⁻²	1.8 × 10⁻²	0.999	2.1 × 10⁻²	2.4 × 10⁻²	2.4 × 10⁻²	0.999	3.3 × 10⁻²	2.5 × 10⁻²	0.999	0.999
	PCO	0.590	6.6 × 10⁻¹	3.5 × 10⁻¹	0.000	5.5 × 10⁻³	1.3 × 10⁻⁵	2.9 × 10⁻¹	7.9 × 10⁻³	1 × 10⁻³	1.2 × 10⁻⁹	0.009	7.4 × 10⁻⁹	1.3 × 10⁻⁸	9.3 × 10⁻⁸	0.004	1.2 × 10⁻¹	1.5 × 10⁻⁶	1.7 × 10⁻²	0.000	2.2 × 10⁻⁶	0.692
	2 SC		0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻⁶	0.9999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA		2.2 × 10²	2.2 × 10²	0.990	0.990	0.999	0.990	6.7 × 10⁰	2.2 × 10⁻²	2.2 × 10⁻²	0.980	0.990	2.2 × 10⁻²	0.990	0.990	0.990	2.2 × 10⁻²	7.6 × 10⁻¹	0.990	2.2 × 10⁻²	2.2 × 10⁻²
	PCO		4.6 × 10⁻¹	1.4 × 10⁻⁸	0.001	1.9 × 10⁻³	2.3 × 10⁻⁵	7.8 × 10⁻¹	6.1 × 10⁻²	4.1 × 10⁻³	1.6 × 10⁻⁸²	0.066	5.3 × 10⁻⁸	2.3 × 10⁻¹	1.9 × 10⁻⁷	0.001	9.31 × 10⁻²	4.7 × 10⁻⁵	2.8 × 10⁻⁶	0.000	3.1 × 10⁻⁵	0.349
	3 SC			2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	0.9999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA			0.999	2.4 × 10⁻²	2.4 × 10⁻²	2.4 × 10⁻²	2.4 × 10⁻²	4.1 × 10⁻²	0.999	0.9999	3.9 × 10⁻²	1.8 × 10⁻²	0.999	2.1 × 10⁻²	2.3 × 10⁻²	2.4 × 10⁻³	0.999	3.3 × 10⁻²	2.5 × 10⁻²	1.0 × 10+1	0.999
	PCO			8 × 10⁻⁸	2.7 × 10⁻³	5.1 × 10⁻¹	1.1 × 10⁻³	0.008	2.4 × 10⁻¹	1.1 × 10⁻²	6.1 × 10⁻⁷⁶	1.3 × 10⁻²	2.6 × 10⁻⁷	1.1 × 10⁻³	2.0 × 10⁻⁶	0.007	0.002	2.1 × 10	3.1 × 10⁻¹	3.3 × 10⁻³	2.3 × 10⁻⁴	2.5 × 10⁻¹
	4 SC				2.2 × 10⁻¹	0.993	0.997	0.994	0.996	0.783	2.2 × 10⁻¹⁶	0.995	0.991	2.2 × 10⁻²	0.997	0.995	2.2 × 10⁻¹	0.995	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹
	TSA				2.4 × 10⁻²	2.4 × 10⁻²	2.2 × 10⁻²	2.0 × 10⁻²	4.1 × 10⁻²	0.999	0.9999	3.9 × 10⁻³	1.8 × 10⁻²	0.999	2.1 × 10⁻³	2.3 × 10⁻²	2.4 × 10⁻²	0.999	3.3 × 10⁻²	2.5 × 10⁻²	0.999	0.999
	PCO				1.2 × 10⁻¹	2.3 × 10⁻²	8.4 × 10⁻¹	1.1 × 10⁻⁵	7 × 10⁻⁴	6.7 × 10⁻³	0.4602	1.7 × 10⁻⁴	0.314	8.3 × 10⁻¹	0.088	7.4 × 10⁻⁷	1.4 × 10⁻⁶	0.006	2 × 10⁻³⁶	9.7 × 10⁻²	1.4 × 10⁻⁹	6.1 × 10⁻²
	5 SC					2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	2.2 × 10⁻¹	0.9999	2.2 × 10⁻⁶	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA					0.990	0.990	0.990	6.8 × 10⁻⁰	2.4 × 10⁻²	2.4 × 10⁻²³	9.8 × 10⁻¹	0.990	2.4 × 10⁻²	0.990	0.990	1.0 × 10	2.4 × 10⁻²	7.5 × 10⁻¹	0.990	2.4 × 10⁻³	2.4 × 10⁻³
	PCO					6.6 × 10⁻⁶	1.1 × 10⁻¹	2.1 × 10⁻⁴	1.1 × 10⁻⁷	4.6 × 10⁻⁷	2.6 × 10⁻¹⁶	0.022	1.6 × 10⁻¹⁵	1.1 × 10⁻¹	9.2 × 10⁻⁴	6.2 × 10⁻¹	1.4 × 10⁻⁴	1.9 × 10⁻¹	1.5 × 10⁻⁶	0.813	7.7 × 10⁻¹	0.000
	6 SC						0.996	0.999	0.996	0.776	2.2 × 10⁻¹⁶	0.998	0.985	2.2 × 10⁻¹	0.996	0.998	2.2 × 10⁻¹	0.998	2.2 × 10⁻¹	0.993	2.2 × 10⁻¹	2.2 × 10⁻⁶
	TSA						0.990	0.990	6.8 × 10⁻¹	2.4 × 10⁻³	2.4 × 10⁻²³	9.8 × 10⁻¹	0.990	2.4 × 10⁻²	0.990	0.990	1.0 × 10	2.4 × 10⁻³	7.6 × 10⁻⁰	0.990	2.4 × 10⁻³	2.4 × 10⁻³
	PCO						0.000	0.000	0.174	0.811	7.3 × 10⁻²⁵	9.5 × 10⁻³	1.3 × 10⁻²	0.000	5.6 × 10⁻²	1.1 × 10⁻²	0.000	3.9 × 10⁻¹	0.244	6.6 × 107	0.000	3.1 × 10⁻⁶
	7 SC							0.997	0.996	0.779	2.2 × 10⁻¹⁶	0.998	0.988	2.2 × 10⁻¹	0.999	0.997	2.2 × 10⁻¹	0.998	2.2 × 10⁻¹	0.997	2.2 × 10⁻¹	2.2 × 10⁻⁶
	TSA							0.990	6.7 × 10⁻¹	2.2 × 10⁻²	2.2 × 10⁻²³	9.8 × 10⁻¹	0.990	2.2 × 10⁻²	0.990	0.990	0.990	2.2 × 10⁻²	7.5 × 10⁻¹	0.990	2.2 × 10⁻²	2.2 × 10⁻³
	PCO							7.9 × 10⁻²	2.2 × 10⁻¹	0.000	7.5 × 10⁻¹¹	1.6 × 10⁻⁹	1.8 × 10⁻¹	0.999	0.000	1.4 × 10⁻⁴	4 × 10⁻²	0.0218	2.3 × 10⁻¹	9.5 × 10⁻²	0.413	7.5 × 10⁻⁰
	8 SC								0.997	0.776	2.2 × 10⁻¹⁶	0.999	0.985	2.2 × 10⁻¹	0.996	0.999	2.2 × 10⁻¹	0.998	2.2 × 10⁻¹	0.9937	2.2 × 10⁻¹	2.2 × 10⁻⁶
	TSA								6.7 × 10⁻¹	2.0⁻²	2.0 × 10⁻²³	9.7 × 10⁻⁰	0.990	2.0 × 10⁻²	0.990	0.990	0.990	2.0 × 10⁻³	7.5 × 10⁻⁰	0.990	2.0 × 10⁻²	2.0 × 10⁻³
	PCO								0.000	0.000	1.8 × 10⁻⁴⁹	6.5 × 10⁻²	9.9 × 10⁻⁶	7.9 × 10⁻²	3.8 × 10⁻⁴	0.000	0.831	1.1 × 10⁻²	0.001	3.6 × 10⁻⁵	4.8 × 10⁻²	1.6 × 10⁻⁶
	9 SC									0.779	2.2 × 10⁻¹⁶	0.998	0.982	2.2 × 10⁻¹	0.999	0.998	2.2 × 10⁻¹	0.998	2.2 × 10⁻¹	0.996	2.2 × 10⁻¹	2.2 × 10⁻⁶
	TSA									4.1 × 10⁻²³	6.9 × 10⁻⁰¹	6.7 × 10⁻⁰¹	4.1 × 10⁻²³	6.7 × 10⁻⁰¹	6.8 × 10⁻⁰¹	6.8 × 10⁻⁰¹	4.1 × 10⁻²	9.1 × 10⁻⁰	6.8 × 10⁻⁰	4.1 × 10⁻³	4.1 × 10⁻²	4.1 × 10⁻³
	PCO									0.261	2 × 10⁻⁴	2.6 × 10⁻⁵	3.5 × 10⁻³	2.2 × 10⁻¹¹	3.4 × 10⁻³	4.9 × 10⁻¹⁷	0.0002	1.4 × 10⁻¹	0.9137	4.3 × 10⁻⁸	3 × 10⁻¹⁴	7.3 × 10⁻³
	10 SC										2.2 × 10⁻¹	0.777	0.790	2.2 × 10⁻¹	0.78	0.777	2.2 × 10⁻¹	0.778	2.2 × 10⁻¹	0.782	2.2 × 10⁻¹	2.2 × 10⁻¹
	TSA										0.999	3.9 × 10⁻³	1.8 × 10⁻³	0.999	2.1 × 10⁻²	2.3 × 10⁻²	2.5 × 10⁻²	0.999	3.3 × 10⁻²	2.5 × 10⁻³	0.999	0.999
	PCO										3.1 × 10⁻²	1 × 10⁻⁵	1 × 10⁻²⁴	0.001	9.1 × 10⁻²	2.5 × 10⁻²	0.002	5.6 × 10⁻¹	0.348	2.8 × 10⁻⁷	0.000	4.5 × 10⁻⁶
	11 SC											2.2 × 10⁻¹	2.2 × 10⁻⁶	0.999	2.2 × 10⁻¹	2.2 × 10⁻²	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻⁶	0.999	0.999
	TSA											0.999	3.9 × 10⁻³	1.8 × 10⁻³	0.999	2.1 × 10⁻²	2.3 × 10⁻²	2.4 × 10⁻²	0.999	3.3 × 10⁻³	2.5 × 10⁻³	0.999
	PCO											2.6 × 10⁻¹	0.767	7.5 × 10⁻¹	0.286	1.4 × 10⁻⁷	8.5 × 10⁻⁶	0.004	1.7 × 10⁻³	3.9 × 10⁻⁸	8.8 × 10⁻¹	8.5 × 10⁻²
	12 SC												0.986	2.2 × 10⁻⁶	0.997	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.996	2.2 × 10⁻⁶	2.2 × 10⁻¹
	TSA												9.7 × 10⁻¹	3.9 × 10⁻³	9.79 × 10⁻¹	9.8 × 10⁻¹	9.8 × 10⁻⁰	3.9 × 10⁻³	7.7 × 10⁻⁰	0.980	3.9 × 10⁻³	3.9 × 10⁻²
	PCO												2.9 × 10⁻¹	1.6 × 10⁻¹	2.1 × 10⁻¹	0.000	1.3 × 10⁻³	9.4 × 10⁻⁸	1.3 × 10⁻⁴	0.008	4.9 × 10⁻¹	0.001
	13 SC													2.2 × 10⁻¹	0.988	0.985	2.2 × 10⁻¹	0.986	2.2 × 10⁻¹	0.998	2.2 × 10⁻¹	2.2 × 10⁻¹
	TSA													1.8 × 10⁻²	0.990	0.990	0.990	1.8 × 10⁻²	7.5 × 10⁻⁰	0.990	1.8 × 10⁻³	1.80 × 10⁻³
	PCO													1.8 × 10⁻¹	0.451	1.3 × 1068	1.3 × 10⁻⁵	0.010	2 × 10⁻³	9.4 × 10⁻²	0.000	5.9 × 10⁻¹
	14 SC														2.2 × 10⁻⁶	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA														2.1 × 10⁻³	2.8 × 10⁻³	2.4 × 10⁻³	0.999	3.3 × 10⁻²	2.5 × 10⁻²	0.999	0.999
	PCO														0.000	1.4 × 10⁻³	4 × 10⁻³	0.021	2.3 × 10⁻¹	9.6 × 10⁻¹	0.413	7.5 × 10⁻¹
	15 SC															0.997	2.2 × 10⁻⁶	0.997	2.2 × 10⁻¹	0.997	2.2 × 10⁻¹	2.2 × 10⁻¹
	TSA															0.990	0.990	2.1 × 10⁻³	7.5 × 10⁻⁰	0.990	2.1 × 10⁻²	2.1 × 10⁻²
	PCO															1.6 × 10⁻⁶	2.2 × 10⁻⁵	0.058	8.8 × 10⁻²	2.8 × 10⁻²	0.000	1.8 × 10⁻¹
	16 SC																2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.994	2.2 × 10⁻¹	2.2 × 10⁻¹
	TSA																0.990	2.3 × 10⁻³	7.5 × 10⁻⁰	0.990	2.3 × 10⁻²	2.3 × 10⁻²
	PCO																0.000	2.7 × 10⁻⁴	8.8 × 10⁻¹	1.1 × 10⁻²	1.5 × 10⁻⁴	0.007
	17 SC																	2.2 × 10⁻¹	0.999	2.2 × 10⁻¹	0.999	0.999
	TSA																	2.4 × 10⁻²	7.5 × 10⁻⁰	0.990	2.4 × 10⁻²	2.4 × 10⁻²
	PCO																	3.4 × 10⁻²	0.005	8.6 × 10⁻⁴	4.2 × 10⁻²	1.2 × 10⁻²
	18 SC																		2.2 × 10⁻¹	0.995	2.2 × 10⁻²	2.2 × 10⁻¹
	TSA																		3.3 × 10⁻²	2.5 × 10⁻³	0.999	0.999
	PCO																		1 × 10⁻¹	6.2 × 10⁻¹¹	0.009	9.1 × 10⁻⁷
	19 SC																			2.2 × 10⁻¹	0.999	0.999
	TSA																			7.6 × 10⁻⁰	3.3 × 10⁻²	3.3 × 10⁻²
	PCO																			2 × 10⁻⁷²	1.2 × 10⁻¹	2.2 × 10⁻²
	20 SC																				2.2 × 10⁻¹	2.2 × 10⁻¹⁶
	TSA																				2.5 × 10⁻³	2.5 × 10⁻²
	PCO																				6.3 × 10⁻¹	1.6 × 10⁻⁰
	21 SC																					0.999
	TSA																					0.999
	PCO																					0.999
PDC with 5 regions	1 SC	2.5 × 10⁻¹	2.5 × 10⁻¹	2.5 × 10⁻¹	0.999
	TSA	7.5 × 10⁻²	3.0 × 10⁻²	0.999	1.5 × 10⁻²
	PCO	2.1 × 10⁻⁰	2.2 × 10	4.1 × 10⁻⁶	0.000
	2 SC		0.615	0.874	2.2 × 10⁻¹
	TSA		3.8 × 10⁻¹	7.5 × 10⁻²	0.970
	PCO		7.3 × 10⁻⁴	2.2 × 10⁻¹	1.1 × 10⁻⁴
	3 SC			0.5083	2.2 × 10⁻¹
	TSA			7.5 × 10⁻²	4.4 × 10⁻¹
	PCO			2.2 × 10⁻¹	0.994
	4 SC				2.2 × 10⁻¹
	TSA				1.5 × 10⁻²
	PCO				1.5 × 10⁻³

References

Lawton, J.H.; May, R.M. (Eds.) Extinction Rates; Oxford University Press: Oxford, UK, 1995; pp. xii + 233. ISBN 0-19-854829. [Google Scholar] [CrossRef]
Butchart, S.H.M.; Walpole, M.; Collen, B.; Van Strien, A.; Scharlemann, J.P.W.; Almond, R.E.A.; Baillie, J.E.M.; Bomhard, B.; Brown, C.; Bruno, J.; et al. Global biodiversity: Indicators of recent declines. Science 2010, 328, 1164–1168. [Google Scholar] [CrossRef]
Intergovernmental Science-Policy Platform on Biodiversity and Ecosys-tem Services (IPBES). Summary for Policymakers of the Global Assessment Report on Biodiversity and Ecosystem Services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services; IPBES Secretariat: Bonn, Germany, 2019; pp. 22–47. [Google Scholar]
Narain, D.; Sonter, L.; Lechner, A.; Watson, J.; Simmonds, J.; Maron, M. Global assessment of the biodiversity safeguards of development banks that finance infrastructure. Conserv. Biol. 2023, 37, e14095. [Google Scholar] [CrossRef]
Dinerstein, E.; Olson, D.; Joshi, A.; Vynne, C.; Burgess, N.D.; Wikramanayake, E.; Hahn, N.; Palminteri, S.; Hedao, P.; Noss, R.; et al. Al-Shammari, Muhammad Saleem, An Ecoregion-Based Approach to Protecting Half the Terrestrial Realm. BioScience 2017, 67, 534–545. [Google Scholar] [CrossRef] [PubMed]
Franklin, J. Species distribution modelling supports the study of past, present and future biogeographies. J. Biogeogr. 2023, 50, 1533–1548. [Google Scholar] [CrossRef]
Whittaker, R.J.; Araújo, M.B.; Jepson, P.; Ladle, R.J.; Watson, J.E.M.; Willis, K.J. Conservation biogeography: Assessment and prospect. Divers. Distrib. 2005, 11, 3–23. [Google Scholar] [CrossRef]
Hortal, J.; Lobo, J.M.; Martin-PIERA, F. Una estrategia para obtener regionalizaciones bióticas fiables a partir de datos incompletos: El caso de los escarabeidos (coleoptera, scarabaeinae) ibérico-baleares. Graellsia 2003, 59, 331–344. [Google Scholar] [CrossRef]
Fisher, B.L. Improving inventory efficiency: A case study of leaf-litter ant diversity in madagascar. Ecol. Appl. 1999, 9, 714–731. [Google Scholar] [CrossRef]
Joaquín, C.; Magnus, N.; Alexis, R.; Anton, E.; Martin, R. Regularities in species’ niches reveal the world’s climate regions. eLife 2021, 10, e58397. [Google Scholar] [CrossRef]
Gonzalez-Orozco, C.E.; Laffan, S.W.; Knerr, N.; Miller, J.T. A biogeographical regionalization of Australian Acacia species. J. Biogeogr. 2013, 40, 2156–2166. [Google Scholar] [CrossRef]
Bloomfield, N.J.; Knerr, N.; Encinas-Viso, F. A comparison of network and clustering methods to detect biogeographical regions. Ecography 2018, 41, 1–10. [Google Scholar] [CrossRef]
Wallace, A.R. The Geographical Distribution of Animals; Harper and Brothers: New York, NY, USA, 1876; Volumes I & II. [Google Scholar]
Ojeda, F.; Marañón, T.; Arroyo, J. Patterns of ecological, chorological and taxonomic diversity at both sides of the Strait of Gibraltar. J. Veg. Sci. 1996, 7, 63–72. [Google Scholar] [CrossRef]
Gao, P.; Kupfer, J.A. Capitalizing on a Wealth of Spatial Information: Improving Biogeographic Regionalization Through the Use of Spatial Clustering. Appl. Geogr. 2018, 99, 98–108. [Google Scholar] [CrossRef]
Ariza-Salamanca, A.J.; González-Moreno, P.; López-Quintanilla, J.B.; Navarro-Cerrillo, R.M. Large-Scale Mapping of Complex Forest Typologies Using Multispectral Imagery and Low-Density Airborne LiDAR: A Case Study in Pinsapo Fir Forests. Remote Sens. 2024, 16, 3182. [Google Scholar] [CrossRef]
Gonzalez-Orozco, C. Biogeographical regionalization of Colombia: A revised area taxonomy. Phytotaxa 2021, 484, 247–260. [Google Scholar] [CrossRef]
Holdridge, L.R. Life Zone Ecology; Tropical Science Center: San José, Costa Rica, 1967; p. 206. [Google Scholar]
Sun, Y.; Niu, J. Regionalization of Daily Soil Moisture Dynamics Using Wavelet-Based Multiscale Entropy and Principal Component Analysis. Entropy 2019, 21, 548. [Google Scholar] [CrossRef]
Khan, A.J.; Koch, M. Correction and Informed Regionalization of Precipitation Data in a High Mountainous Region (Upper Indus Basin) and Its Effect on SWAT-Modelled Discharge. Water 2018, 10, 1557. [Google Scholar] [CrossRef]
Hargrove, W.; Hoffman, F. Potential of Multivariate Quantitative Methods for Delineation and Visualization of Ecoregions. Environ. Manag. 2004, 34 (Suppl. S1), S39–S60. [Google Scholar] [CrossRef]
Rousseau, J.; Betts, M. Factors influencing transferability in species distribution models. Ecography 2022, 2022, e06060. [Google Scholar] [CrossRef]
Laffan, S.; Lubarsky, E.; Rosauer, D. Biodiverse: A tool for the spatial analysis of biological and other diversity. Ecography 2010, 33, 643–647. [Google Scholar] [CrossRef]
Daru, B.H.; Karunarathne, P.; Schliep, K. phyloregion: R package for biogeographic regionalization and macroecology. Methods Ecol. Evol. 2020, 11, 1483–1491. [Google Scholar] [CrossRef]
Lemus-Canovas, M.; Lopez-Bustins, J.A.; Martin-Vide, J.; Royé, D. synoptReg: An R package for computing a synoptic climate classification and a spatial regionalization of environmental data. Environ. Model. Softw. 2019, 118, 114–119. [Google Scholar] [CrossRef]
Zhao, W.; Ma, J.; Liu, Q.; Song, J.; Tysklind, M.; Liu, C.; Wang, D.; Qu, Y.; Wu, Y.; Wu, F. Comparison and application of SOFM, fuzzy c-means and k-means clustering algorithms for natural soil environment regionalization in China. Environ. Res. 2023, 216, 114519. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Zhang, Z.; Huang, H.; Zhou, Y. An overview on deep clustering. Neurocomputing 2024, 590, 127761. [Google Scholar] [CrossRef]
Aydin, O.; Janikas, M.; Assuncao, R.; Lee, T.-H. A quantitative comparison of regionalization methods. Int. J. Geogr. Inf. Sci. 2021, 35, 2287–2315. [Google Scholar] [CrossRef]
Chen, X.; Zhang, C.; Chen, X.; Saunier, N.; Sun, L. Discovering Dynamic Patterns from Spatiotemporal Data with Time-Varying Low-Rank Autoregression. arXiv 2022, arXiv:2211.15482v1. [Google Scholar] [CrossRef]
Del Barrio, G.; Sanjuán, M.E.; Martínez-Valderrama, J.; Ruiz, A. Descripción y Ensayo de un Procedimiento de Regionalización Climática del Territorio; Serie “Metodologías para el seguimiento del estado de conservación de los tipos de hábitat”; Ministerio Para la Transición Ecológica: Madrid, Spain, 2019; p. 42. [Google Scholar]
Pata, P.; Galbraith, M.; Young, K.; Margolin, A.; Perry, R.; Hunt, B. Data-driven determination of zooplankton bioregions and robustness analysis. MethodsX 2024, 12, 102676. [Google Scholar] [CrossRef]
Grassi, K.; Poisson-Caillault, É.; Bigand, A.; Lefebvre, A. Comparative Study of Clustering Approaches Applied to Spatial or Temporal Pattern Discovery. J. Mar. Sci. Eng. 2020, 8, 713. [Google Scholar] [CrossRef]
Pampuch, L.; Negri, R.; Loikith, P.; Bortolozo, C. A Review on Clustering Methods for Climatology Analysis and Its Application over South America. Int. J. Geosci. 2023, 14, 877–894. [Google Scholar] [CrossRef]
Hartigan, J.A.; Wong, M.A. A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Knowledge Discovery and Data Mining KDD, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Tortora, C.; Summa, M.G.; Palumbo, F. Factor PD-Clustering. arXiv 2011, arXiv:1106.3830v1. [Google Scholar] [CrossRef]
Połap, D.; Prokop, K.; Srivastava, G. Federated Heuristic Optimization Based on Fuzzy Clustering and Red Fox Optimization Algorithm. In Proceedings of the 2023 IEEE International Conference on Fuzzy Systems (FUZZ), Incheon, Republic of Korea, 13–17 August 2023; pp. 1–6. [Google Scholar] [CrossRef]
Rosvall, M.; Bergstrom, C. Maps of Random Walks on Complex Networks Reveal Community Structure. Proc. Natl. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef] [PubMed]
Legendre, P.; y Legendre, L. Numerical Ecology, 2nd ed.; Elsevier: Amsterdam, The Netherlands, 1998. [Google Scholar]
Paula-Souza, L.B.D.; Diniz Filho, J.A.F. Variance partitioning and spatial eigenvector analyses with large macroecological datasets. Front. Biogeogr. 2020, 12. [Google Scholar] [CrossRef]
Kreft, H.; Jetz, W. A framework for delineating biogeographical regions based on species distributions. J. Biogeogr. 2010, 37, 2029–2053. [Google Scholar] [CrossRef]
Fick, S.E.; Hijmans, R.J. WorldClim 2: New 1km spatial resolution climate surfaces for global land areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
ISRIC. SoilGrids: Global Gridded Soil Information. 2023. Available online: https://files.isric.org/soilgrids (accessed on 17 April 2024).
Friedman, J. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. K-Means++: The Advantages of Careful Seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms; Society for Industrial and Applied Mathematics: New Orleans LA, USA, 2007; Volume 8, pp. 1027–1035, ISBN 9780898716245. [Google Scholar]
Starczewski, A.; Cader, A. Determining the Eps Parameter of the DBSCAN Algorithm. In Artificial Intelligence and Soft Computing, Proceedings of the 18th International Conference, ICAISC 2019; Zakopane, Poland, 16–20 June 2019, Proceedings, Part II 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 420–430. [Google Scholar] [CrossRef]
Ferraro, M.B.; Giordani, P.; Serafini, A. fclust: An R Package for Fuzzy Clustering. R J. 2019, 11, 1–18. Available online: https://journal.r-project.org/archive/2019/RJ-2019-017/RJ-2019-017.pdf (accessed on 4 November 2024). [CrossRef]
Scrucca, L. GA: A Package for Genetic Algorithms in R. J. Stat. Softw. 2013, 53, 1–37. [Google Scholar] [CrossRef]
Bedia, J.; Herrera, S.; Guti_errez, J.M. Dangers of using global bioclimatic datasets for ecological niche modeling. limitations for future climate projections. Glob. Planet. Chang. 2013, 107, 1–12. [Google Scholar] [CrossRef]
cariouCsárdi, G.; Nepusz, T.; Traag, V.; Horvát Sz Zanini, F.; Noom, D.; Müller, K. igraph: Network Analysis and Visualization in R. R Package Version2.1.1. 2024. Available online: https://CRAN.R-project.org/package=igraph (accessed on 6 November 2024). [CrossRef]
Cariou, C.; Le Moan, S.; Chehdi, K. Improving K-Nearest Neighbor Approaches for Density-Based Pixel Clustering in Hyperspectral Remote Sensing Images. Remote Sens. 2020, 12, 3745. [Google Scholar] [CrossRef]
Dray, S.; Legendre, P.; Peres-Neto, P.R. Spatial modelling: A comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). Ecol. Model. 2006, 196, 483–493. [Google Scholar] [CrossRef]
Frau, C.; Pino, L.; Rojas, Y.; Hernández, Y. Generalización de modelo digital de elevación condicionada por puntos críticos de terreno. Bol. Cienc. Geod. 2011, 17, 439–457. [Google Scholar] [CrossRef]
QGIS Geographic Information System. Open Source Geospatial Foundation Project. Available online: https://qgis.org (accessed on 10 November 2024).
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Sim, J.; Wright, C.C. The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Phys. Ther. 2005, 85, 257–268. [Google Scholar] [CrossRef]
Foody, G. Explaining the unsuitability of the kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification. Remote Sens. Environ. 2020, 239, 111630. [Google Scholar] [CrossRef]
Ridgeway, G. Gbm: Generalized Boosted Regression Models. Compute 1. 1-12. 2005. Available online: https://cran.r-project.org/web/packages/gbm/gbm.pdf (accessed on 11 November 2024).
Smith, J.R.; Hendershot, J.N.; Nova, N.; Daily, G.C. The biogeography of ecoregions: Descriptive power across regions and taxa. J. Biogeogr. 2020, 47, 1413–1426. [Google Scholar] [CrossRef]
Warton, D.; Foster, S.; De’ath, G.; Stoklosa, J.; Dunstan, P. Model-based thinking for community ecology. Plant Ecol. 2015, 216, 669–682. [Google Scholar] [CrossRef]
Instituto Nacional Geográfico. España en mapas. In Una Síntesis Geográfica; Serie Compendios del Atlas Nacional de España (ANE); Centro Nacional de Información Geográfica: Madrid, Spain, 2019; p. 620. [Google Scholar] [CrossRef]
Díaz González, T.E.; Penas, Á. The High Mountain Area of Northwestern Spain: The Cantabrian Range, the Galician-Leonese Mountains and the Bierzo Trench. In The Vegetation of the Iberian Peninsula. Plant and Vegetation; Loidi, J., Ed.; Springer: Cham, Switzerland, 2017; Volume 12. [Google Scholar] [CrossRef]
Vaccaro, I.; Beltran, O. (Eds.) Social and Ecological History of the Pyrenees: State, Market, and Landscape (New Frontiers in Historical Ecology); Left Coast Press: Walnut Creek, CA, USA, 2010. [Google Scholar]
Janík, T.; Romportl, D. Comparative landscape typology of the Bohemian and Bavarian Forest National Parks. Eur. J. Environ. Sci. 2016, 6, 114–118. [Google Scholar] [CrossRef]
WWF. Iberian Sclerophyllous and Semi-Deciduous Forests. 2019. Available online: https://www.worldwildlife.org/ecoregions/pa1209 (accessed on 17 April 2024).
WWF. Iberian Conifer Forests. 2019. Available online: https://www.worldwildlife.org/ecoregions/pa1208 (accessed on 17 April 2024).
WWF. Western Europe and Northern Africa: Parts of Portugal, Spain, France, Italy, and Morocco. 2019. Available online: https://www.worldwildlife.org/ecoregions/pa1221 (accessed on 20 May 2024).
Mucina, L.; Bueltmann, H.; Dierßen, K.; Theurillat, J.-P.; Raus, T.; Čarni, A.; Šumberová, K.; Willner, W.; Dengler, J.; Gavilán, R.; et al. Vegetation of Europe: Hierarchical floristic classification system of vascular plant, bryophyte, lichen, and algal communities. Appl. Veg. Sci. 2016, 19, 3–264. [Google Scholar] [CrossRef]
Joint Research Centre of the European Commission. The Digital Observatory for Protected Areas (DOPA) Explorer 3.1: North Saharan Steppe and Woodlands. 2019. Available online: https://dopa-explorer.jrc.ec.europa.eu/ecoregion/81321 (accessed on 23 April 2024).
Boubker, J. La Certification des Agrumes au Maroc. CIHEAM-IAMB 21, Options. Méditerranéennes; Série B/021-Proceedings of the Mediterranean Network on Certification of Citrus; CIHEAM: Bari, Italy, 2004. [Google Scholar]
Finlayson, G.; Finlayson, C.; Espejo, J.M.R. Dynamics of a thermo-Mediterranean coastal environment—The Coto Doñana National Park. Quat. Sci. Rev. 2008, 27, 2145–2152. [Google Scholar] [CrossRef]
Oro, D.; Martínez-Vilalta, A. Migration and dispersal of Audouin’s Gull Lams audouinii from the Ebro Delta. Ostrich 1994, 65, 225–230. [Google Scholar] [CrossRef]
González-Solís, J.; Bernadí, X.; Ruiz, X. Seasonal Variation of Waterbird Prey in the Ebro Delta Rice Fields. Colon. Waterbirds 1996, 19, 135–142. [Google Scholar] [CrossRef]
Schenker, A.; Cahenzli, F.; Gutbrod, K.; Thévenot, M.; Erhardt, A. The Northern Bald Ibis Geronticus eremita in Morocco since 1900: Analysis of ecological requirements. Bird Conserv. Int. 2020, 30, 117–138. [Google Scholar] [CrossRef]

Figure 1. Example of Graph K-NN with different eps values.

Figure 3. Graph of the sum of squares within groups (lower line) and sum of squares between groups (upper line) to determine the number of clusters with test data.

Figure 4. Regions obtained without spatial contiguity: (a) 21 regions resulting from DBSCAN; (b) results from K-means++ with k = 5; (c) results from K-means++ with k = 22; (d) results from PDC with k = 5.

Figure 5. Regionalization obtained with Infomap: (a) Result without contiguity with k = 5; (b) Result without contiguity with k = 3.

Figure 6. Regionalization including spatial contiguity for K-means++: (a) Result with K-MEANS++ including contiguity as principal coordinates (PCO) with K = 5. (b) Result including contiguity as a third-degree polynomial (TSA) with k = 5. (c) Result including contiguity as principal coordinates (PCO) with K = 22. (d) Result including contiguity as a third-degree polynomial (TSA) with k = 22.

Figure 7. Regionalization obtained with Infomap: (a) Result including contiguity as a third-degree polynomial (TSA) with k = 5. (b) Result including contiguity as principal coordinates (PCO) with K = 3. (b1) Result including contiguity as principal coordinates (PCO) with K = 5.

Figure 8. Regions after cluster regrouping obtained with the PDC algorithm.

Figure 9. (a) Result with PDC including contiguity as principal coordinates (PCO) with K = 5; (b) Result with PDC including contiguity as principal coordinates (PCO) with K = 22.

Figure 10. Distribution of the silohuette score of the algorithm with index kappa more significant with respect to reference regionalization. Federate: Fuzzy federated algorithm; Infomap: Algorithm with k = 5 and contiguity as a third-degree polynomial. K-means++: Algorithm with k = 5 and contiguity as a third-degree polynomial. PDC: Algorithm with k = 3 and contiguity as a third-degree polynomial.

Figure 11. Result of the federated heuristic optimization based on fuzzy clustering using the centroids of the K-means++ TSA model and k = 5 as centers for fuzzy clustering.

Table 1. Codes and description of the variables used.

Code	Type	Description
BIO1	Climatic	Annual Mean Temperature
BIO2	Climatic	Mean Diurnal Range (Mean of monthly (max temp–min temp))
BIO3	Climatic	Isothermality (BIO2/BIO7) (×100)
BIO4	Climatic	Temperature Seasonality (standard deviation × 100)
BIO5	Climatic	Max Temperature of Warmest Month
BIO6	Climatic	Min Temperature of Coldest Month
BIO7	Climatic	Temperature Annual Range (BIO5-BIO6)
BIO8	Climatic	Mean Temperature of Wettest Quarter
BIO9	Climatic	Mean Temperature of Driest Quarter
BIO10	Climatic	Mean Temperature of Warmest Quarter
BIO11	Climatic	Mean Temperature of Coldest Quarter
BIO12	Climatic	Annual Precipitation
BIO13	Climatic	Precipitation of Wettest Month
BIO14	Climatic	Precipitation of Driest Month
BIO15	Climatic	Precipitation Seasonality (Coefficient of Variation)
BIO16	Climatic	Precipitation of Wettest Quarter
BIO17	Climatic	Precipitation of Driest Quarter
BIO18	Climatic	Precipitation of Warmest Quarter
BIO19	Climatic	Precipitation of Coldest Quarter
Lat	Spatial	Geographic coordinates (Latitude)
Lon	Spatial	Geographic coordinates (Longitude
Evapo	Climatic	Potential evapotranspiration
Ph	Edaphic	PH of soil

Table 2. The number of regions based on the percentage of closest nodes selected to create a community. SC: Contiguity not included; TSA: Contiguity included as third-degree polynomial; PCO: Contiguity included as principal coordinates.

Percentage of Closest Nodes	SC	TSA	PCO
10	14	11	13
15	7	6	7
20	6	5	6
25	6	4	5
30	5	3	4

Table 3. Kappa index of the overlapping regions obtained with the different algorithms (Algorithm: Algorithms applied in the study, REG.OLSON: Reference regionalization used in the study; k: number of regions to compare; SC: “without contiguity”; TSA: “spatial contiguity as a third-degree polynomial”; PCO: “spatial contiguity as eigenvectors”).

	Algorithm	K-Means++ (SC)	PDC (SC) *	K-Means++ (TSA)	K-Means++ (PCO)	PDC (TSA) *	PDC (PCO)	INFOMAP (SC)*	INFOMAP (TSA) *	INFOMAP (PCO) *	FEDERATE MODEL	REG. OLSON
Algorithm	K	5/22	3	5/22	5/22	3	5/22	3	5	3	7	5/22
K-means++ (SC)	5/22	1/1	0.3218	0.1929/0.1193	0.0703/−0.032	0.3205	0.3042/0.0118	0.1101	−0.1846	−0.1370	0.0154	0.1055/0.0573
PDC (SC) *	3	0.3218	1	0.4159	−0.0356	0.2478	0.2518	0.0133	0.0479	−0.1817	0.0582	0.2416
K-means++ (TSA)	5/22	0.1929/0.1193	0.4159	1	−0.2196/−0.0107	0.3205	−0.0053/0.0216	0.0205	0.3135	−0.2605	0.356	0.2076/0.048
K-means++ (PCO)	5/22	0.0703/ −0.032	−0.0356	−0.2196/ −0.0107	1	−0.1647	−0.1537/−0.0148	0.0058	0.2823	0.0561	0.0274	0.0483/−0.0234
PDC (TSA) *	3	0.0243	0.2478	0.3205	−0.1647	1	−0.0796	0.1349	−0.4020	−0.1907	0.1201	0.2427
PDC (PCO)	5/22	0.3042/0.0118	0.2518	−0.0053/0.0216	−0.1537/−0.0148	−0.0796	1	0.0352	−0.068	−0.0354	−0.0495	−0.153/0.063
INFOMAP (SC) *	3	0.1101	0.0133	0.0205	0.0058	0.1349	0.0352	1	−0.0087	−0.1097	−0.0012	0.0364
INFOMAP (TSA) *	5	−0.1846	0.0479	0.3135	0.2823	−0.4020	−0.068	−0.0087	1	0.1749	−0.086	0.1311
INFOMAP (PCO) *	3	−0.1370	−0.1817	−0.2605	0.0561	−0.1907	−0.0354	−0.1097	0.1749	1	0.0256	−0.1286
FEDERATE MODEL	7	0.0154	0.0582	0.356	0.0274	0.1201	−0.0495	−0.0012	−0.086	0.0256	1	0.069/0.003
DBSCAN (SC)	5/22	0.2049/−0.003	0.1613	0.1697/−0.002	−0.1375/−0.002	0.4795	0.0410/0.0411	0.3197	−0.1594	−0.1955	0.0257	−0.0002/0.0024
DBSCAN (TSA)	5/22	0.2096/−0.003	0.1639	0.1795/−0.002	−0.0139/−0.0023	0.4858	0.040/0.0388	0.3262	−0.1555	−0.1940	0.0225	−0.0002/0.0030
DBSCAN (PCO)	5/22	0.099/0.0126	0.0459	0.084/0.0401	−0.0493/−0.0165	0.0681	0.0131/0.0185	0.1137	−0.0093	−0.0942	−0.048	−0.007/0.0052

* In these cases, the kappa index is calculated only for k < 22.

Table 4. Relative importance of applying gradient boosting with the variables of the model with the highest kappa index. (K-means++ (TSA) with 5 regions).

Feature	Relative Importance
Bio1—Annual Mean Temperature	51.71885
Bio11—Mean Temperature of Coldest Quarter	17.91296
Bio15—Precipitation Seasonality (Coefficient of Variation)	10.67240
Bio14—Precipitation of Driest Month	4.64482
Bio12—Annual Precipitation	2.75624
Bio18—Precipitation of Warmest Quarter	2.72849
Bio4—Temperature Seasonality (standard deviation ×100)	1.97000
Evapo—Potential evapotranspiration	1.57400
Bio19—Precipitation of Coldest Quarter	1.56720
Bio17—Precipitation of Driest Quarter	1.26907
Bio13—Precipitation of Wettest Month	1.08024
Bio3—Isothermality (BIO2/BIO7) (×100)	0.73156
Bio6—Min Temperature of Coldest Month	0.35595
Bio16—Precipitation of Wettest Quarter	0.22083
Ph—PH of soil	0.17214
Bio10—Mean Temperature of Warmest Quarter	0.16941
Bio2—Mean Diurnal Range (Mean of monthly (max temp—min temp))	0.11501
Bio8—Mean Temperature of Wettest Quarter	0.11230
Bio9—Mean Temperature of Driest Quarter	0.10589
Bio7—Temperature Annual Range (BIO5-BIO6)	0.08240
Bio15—Precipitation Seasonality (Coefficient of Variation)	0.04013

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aguilar Colmenero, J.L.; Portela Garcia-Miguel, J. A Regionalization Approach Based on the Comparison of Different Clustering Techniques. Appl. Sci. 2024, 14, 10563. https://doi.org/10.3390/app142210563

AMA Style

Aguilar Colmenero JL, Portela Garcia-Miguel J. A Regionalization Approach Based on the Comparison of Different Clustering Techniques. Applied Sciences. 2024; 14(22):10563. https://doi.org/10.3390/app142210563

Chicago/Turabian Style

Aguilar Colmenero, José Luis, and Javier Portela Garcia-Miguel. 2024. "A Regionalization Approach Based on the Comparison of Different Clustering Techniques" Applied Sciences 14, no. 22: 10563. https://doi.org/10.3390/app142210563

APA Style

Aguilar Colmenero, J. L., & Portela Garcia-Miguel, J. (2024). A Regionalization Approach Based on the Comparison of Different Clustering Techniques. Applied Sciences, 14(22), 10563. https://doi.org/10.3390/app142210563

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Regionalization Approach Based on the Comparison of Different Clustering Techniques

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Variables Used

2.3. Clustering Methods

2.4. Spatial Contiguity

2.5. Comparison of Regions and Selection of Results with Spatial Coherence

2.6. Importance of Variables in the Construction of the Regions

3. Results

3.1. Regionalization Without Spatial Contiguity

3.2. Regionalization with Spatial Contiguity

3.3. Comparison of the Regions

3.4. Federated Heuristic Optimization Based on Fuzzy Clustering

3.5. Most Important Variables in the Construction of the Regions

4. Discussion

4.1. Limitations of the Algorithms Used

4.2. Analytical Regionalization Versus Reference Regionalization

4.3. Characterization of the Regions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI