Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets

Zou, Muquan; Wang, Lizhen; Wu, Pingping; Tran, Vanha

doi:10.3390/ijgi11080418

Open AccessArticle

Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets

¹

Department of Computer and Engineering, Yunnan University, Kunming 650091, China

²

School of Information Engineering, Kunming University, Kunming 650214, China

³

Key Laboratory of Data Governance and Intelligent Decision in Universities of Yunnan, Kunming University, Kunming 650214, China

⁴

Department of Computer Science and Engineering, Dianchi College, Yunnan University, Kunming 650228, China

⁵

Departement of Information Technology Specialization, FPT University, Hoa Lac High Tech Park, Hanoi 155514, Vietnam

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(8), 418; https://doi.org/10.3390/ijgi11080418

Submission received: 26 May 2022 / Revised: 13 July 2022 / Accepted: 20 July 2022 / Published: 23 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

A co-location pattern is a set of spatial features whose instances are frequently correlated to each other in space. Its mining models always consist of two essential steps. One step is to generate neighbor relationships between spatial instances, and another step is to check the prevalence of candidate patterns on the clique, star or Delaunay triangulation relationships. At least three major issues are addressed in this paper. First, since different spatial regions, different distribution densities, it is difficult to set appropriate parameters to generate ideal neighbor relationships. Second, the clique relationship and the others are so strongly rigid that the users’ personal interests are suppressed; some interesting patterns are neglected without increasing redundancy. Third, the different strength of correlations among instances are neglected in prevalence calculation. It causes correlations among features to be undifferentiated. Accordingly, the main work of this paper includes: (1) The neighbor relationship generation can be improved on the idea that the distances between an instance and any of its neighbors are not remarkably different. (2) The type-

β

co-location pattern is defined and checked based on a co-occurrence where the closeness centrality of each instance is not less than a given threshold

β

. (3) Since the closeness centrality carries strength of correlations among instances, the strength of the correlations between a feature and the other ones in a type-

β

co-location pattern can be evaluated with prevalence calculation. Finally, experiments on synthetic and real-world spatial data sets are used to assess the effectiveness and efficiency of our works. The results show that fewer spatial neighbor relationships are generated, and more interesting patterns can be discovered by flexibly adjusting

β

according to the user’s preferences.

Keywords:

spatial data mining; type-β co-location pattern; spatial topological relationship; closeness centrality; strength of correlation

1. Introduction

Spatial data mining has reached its pinnacle with the advancement of data collecting and processing technology. One of the most important research interests in this domain is co-location pattern mining [1]. For co-location pattern mining in spatial data sets, a spatial feature is a label of some objects (e.g., the clover). Furthermore, an instance is a feature’s object with location information. A co-location pattern is a spatial feature set whose instances are frequently located together in a geographic space [2]. The co-location pattern {the clover, the wasp, the vole, the cat, the cow}, for example, reveals that the instances of the spatial features in this pattern make a healthy livestock ecological system [3]. Co-location pattern mining is widely employed in domains such as ecological protection, public hygiene, urban planning, ad distribution, and so on because co-location patterns may disclose the co-occurrence of spatial features [4].

Generally, there are two major steps to mine co-location patterns in the traditional models.

One step is to generate spatial neighbor relationships in instances from the given spatial data sets. Neighbor relationships (i.e., spatial proximity) follow Tobler’s First Law of geography: Everything is related to everything else but nearby things are more related than distant things. However, Goodhild’s Second Law of Geography claims that geographic variables exhibit uncontrolled variance. That is to say, the ideal neighbors are not only correlated to distances (the First Law) but also related to regional distribution densities (the Second Law). For example, assuming A1 and D1 are neighbors in Figure 1, so are A1 and B1. Although the distance between A3 and B3 is similar to the one between B2 and D1, there is a stronger correlation between A3 and B3 than between B2 and D1 because of their different regional distribution densities.

Another step is to check the prevalence of candidate patterns on the spatial neighbor relationships. The instances of each pattern (i.e., checklists for prevalence) should be on the co-occurrence such as the clique [5], star [6], or triangle-based relationships [7]. Generally, the higher the ratio of co-occurrences, the higher the prevalence of the pattern. For example, {A2, B2, C2, D2} obviously supports the prevalence of {A, B, C, D}, even if it is not a clique or star in Figure 1c.

1.1. Motivation

Although many scholars have been engaged in relevant research, at least three problems remain.

It is difficult to build a model to generate neighbor relationships to adapt data sets with different distribution densities. To obtain neighbor relationships satisfied to densities, many scholars have proposed different solutions. Figure 1 shows some representative approaches on distance thresholds, such as KNN and the Delaunay triangulation. For example, it is intuitive that A1 and B1 are neighbors of each other in a dense region, and so are A3 and B3 in a sparse zone. On the contrary, it is intuitive that B2 and D1 are not neighbors of each other while B3 and D3 are. As a result, it is not friendly to determine an optimal distance threshold to generate neighbor relationships even for experimental users because (a) too small of a distance threshold may underestimate the prevalence of patterns in sparse regions (e.g., Figure 1a) while (b) a too big one may overestimate the prevalence of patterns in dense zones (e.g., Figure 1b). For example, A1 and D1 should not be considered to be neighbors of each other, while D1 and E1 can be considered to be neighbors of each other in Figure 1d. Furthermore, neighbor relationships in different density areas can overlap but not split. This is the biggest difference between co-location pattern mining and transaction-based association analysis. For example, D1 and E1 can be neighbors of each other in a dense zone, and so can B2 and E1 in a sparse region. Thus, this statement is not friendly to classical clustering.
The instance of a pattern should be a co-occurrence, and then it would be perfect to integrate and extend the traditional co-occurrences such as the clique, star, and so on [8]. For example, {A2, B2, C2, D2} is suggested to not be an instance of {A, B, C, D}, while the instance is based on the clique in Figure 1a. However, the correlation in {A2, B2, C2, D2} is also strong. Furthermore, {A, B, C, D} selectively occurs in other regions of Figure 1. If instances of patterns are based on the clique, {A, B, C, D} can be prevalent only in Figure 1b when the prevalence threshold is 1. This cannot meet the expectation that the features in {A, B, C, D} are strongly correlated. Furthermore, the same users may be interested in patterns with different correlations in different data sets, let alone different users. To obtain expected patterns, users always have to resize either the parameter of neighbor relationships generation or prevalence threshold in traditional co-location pattern mining models. It inevitably leads to redundancy. For example, to obtain {A, B, C, D} in Figure 1, the prevalence threshold should be reduced to 1/3 if the neighbor relationships is as Figure 1a, or the distance threshold should be increased to 16 m such as in Figure 1b when the prevalence threshold is 1.
Since traditional models check the prevalence of patterns are generally on features’ instance appearance ratios, it inevitably loses instance topology on the spatial neighbor relationships. For example, it can be acknowledged that {B, C, D} is more correlated than {B, D, E} even if the instances of each corresponding feature have appeared in the two patterns with an adaptive definition of pattern instances. The correlation between a feature and the other ones in a pattern can be evaluated from the topology of the pattern’s instances. Understandably, if the instances of a feature in a pattern always have a higher center in the topology, the feature has a stronger correlation with the other features in the pattern than the other ones have. How can the spatial neighbor relationships be transmitted and accumulated to the interesting patterns? This problem needs to be studied urgently. For example, B and C are more likely to be in the center of the topology than A and D in {A, B, C, D} in Figure 1.

1.2. Overall Solution

Understandably, distances between any instance and its neighbors tend to be similar but not widely different. For example, B1, C1, and D1 can be considered to be neighbors of A1, but B2 is not because it is intuitively further than B1, C1, and D1 from A1 in Figure 1. Moreover, A2, B2, C2, D2, and E1 can be neighbors of F1, but B3 cannot, because B3 is obviously further than A2, B2, C2, D2, and E1 from F1. Thus, a robust way with a compromise of distance threshold and KNN is proposed to generate a more applicable neighbor relationship set in this paper. For an instance denoted

i_{u}

, the distance between

i_{u}

and its nearest neighbor except itself is denoted

d s t

. Given an elastic coefficient

α

and an instance

i_{v}

, if the distance between

i_{u}

and

i_{v}

is not further than

d s t * α

,

i_{v}

can be considered to be a directed neighbor of

i_{u}

. For example, the distances from A1 to B1, C1, D1, E1, and B2 are, respectively, 3, 3.2, 4.8, 10.7, and 15.7. Therefore, B1, C1, and D1 are considered to be directed neighbors of A1 when

1.6 \geq α \leq 3.6

. Furthermore, for any pair of instances, if and only if they are directed neighbors of each other, they are mutual neighbors of each other. For example, D1 and E1 are mutual neighbors of each other but A1 and E1 are not, as A1 is a directed neighbor of E1 but not vice versa. A mutual neighbor relationship graph is shown in Figure 2b. It is obvious that a lower

d s t

leads a denser region. On the contrary, a greater

d s t

leads a sparser region. That is to say, the threshold

d s t * α

can help neighbor relationship generation be adaptive to region distribution densities.

To measure the prevalence of patterns, their instances should be detected on the neighbor relationships. Scholars tend to define patterns’ instances based on the clique. A clique is a subgraph in the spatial neighbor relationship graph where each instance pair are neighbors of each other. For example, {A1, B1, C1} is an instance of {A, B, C} in Figure 1b. Interestingly, the closeness centrality [9] of each instance in a clique is 1. Additionally, the closeness centrality of a node

i_{u}

in n reachable nodes with itself is the reciprocal of the average shortest path distance to

i_{u}

overall

n - 1

reachable nodes. That is to say, the minimum closeness centrality of instances in a clique is 1. However, a co-occurrence is not necessary to be a clique. It can be also a strong co-occurrence such as the star. The minimum closeness centrality of instances in a star is not less than 1/2 (

\frac{k - 1}{2 k - 3} > 1 / 2

, where k is the instance count of the star). Thus, for any pattern p, if there exists a co-occurrence carrying p whose instances’ minimum closeness centrality is not less than a given threshold

β

(

0 < β \leq 1

), the co-occurrence is suggested to be an instance of p in this paper.

The threshold

β

adjusts the correlation of patterns’ instances. Greater

β

, stronger correlations. That is to say, users can set a suitable

β

to cater to their individual interests. Our new co-occurrence based on closeness centrality can be an extension of the clique and the star. To adopt expected interesting patterns, users can also resize

β

instead of rebuilding neighbor relationships or changing the prevalence threshold.

Additionally, for any pattern, if a feature in the pattern always have a higher closeness centrality in the pattern’s instance, it tends to have a higher closeness centrality in the pattern. In other words, a feature in a pattern, whose instance always has higher closeness centrality, always has a stronger correlation with other features in the pattern. Besides, the topology of the patterns’ instances is passed on to features in the pattern.

The contributions of this paper are summarized in the following:

Based on that the distances between an instance and its neighbors tend to be similar, a robust way is introduced to generate neighbor relationships. This method is friendly to different distribution densities of spatial data sets. It absorbs the advantages of distance threshold and nearest neighbors.
A co-occurrence based on closeness centrality is proposed to integrate the clique and the star. It is an extension of instances of the traditional co-location pattern. It can be flexibly scaled with setting the threshold $β$ according to the user’s interests. Some interesting patterns are no longer ignored, while spatial neighbor relationships and prevalence patterns need not be sacrificed to redundancy.
An extended co-location pattern, called type- $β$ co-location pattern, is proposed on the closeness centrality. Since the closeness centrality carries the topology of instances, whether a feature is in the center of the topology of a type- $β$ co-location pattern can be evaluated.
Some properties are demonstrated to prune candidate patterns. Our algorithms that were proven to be valid and comprehensive are proposed. Furthermore, they are put to the test using both real and synthetic data sets. The findings of the trial reveal that the framework is more adaptable to the needs of users in comparison with some other algorithms.

The rest of this paper is laid out as follows. Section 2 begins with an overview of the standard co-location pattern model, followed by a review of previous research. To construct an applicable neighbor relationship graph, Section 3 describes a robust strategy that uses a compromise of distance threshold and KNN. The metrics of pattern prevalence on closeness centrality is proposed in Section 4, and then a standard is introduced to evaluate the dominance of each feature in patterns. Additionally, two corresponding algorithms are proposed. The experimental results are proposed in Section 5. In Section 6, we conclude and discuss our works in this paper.

2. Related Work

In this section, both a traditional co-location pattern mining model and related work are reviewed.

2.1. Traditional Definitions and Lemmas

Given a spatial data set D with m spatial features (

F = {f_{1}, f_{2}, \dots, f_{m}}

) and n spatial feature instances (

I = {i_{1}, i_{2}, \dots, i_{n}}

) where each instance carries a feature label [10].

For any pair of instances (denoted

i_{u}

and

i_{v}

) in a nearby area (e.g.,

d i s (i_{u}, i_{v}) \leq d

where

d i s (i_{u}, i_{v})

returns the distance between

i_{u}

and

i_{v}

, and d is a given distance threshold by a user), we say there exists a neighbor relationship between

i_{u}

and

i_{v}

(denoted

i_{u} R i_{v}

or

(i_{u}, i_{v}) \in R

). Generally,

i_{u}

is a neighbor of itself.

Definition 1

(Neighbor relationship graph). Given a spatial neighbor relationship set R on instance set I, the undigraph

G = (I, R)

is called a spatial neighbor relationship graph.

Example 1. (A1, B1)

\in R

in Figure 1a.

Since different regions always have different distribution densities in spatial data sets, the spatial neighbor relationship graph sometimes is also generated on the nearest neighbors. This situation is discussed in the review subsection.

For any instance subset

I^{'}

(

I^{'} \subseteq I

), the subset composed of features carried by

I^{'}

is called the corresponding feature set of

I^{'}

(denoted

C F s (I^{'})

),

C F s (I^{'}) = {f e a (i_{u}) ∣ i_{u} \in I^{'}}

where

f e a (i_{u})

returns the feature carried by

i_{u}

.

Definition 2

(Co-location table instance

C I s (p)

). Given a nonempty feature subset p (

p \subseteq F

), let

I^{'}

be a subset of I whose corresponding feature set is p. IF

I^{'}

is a clique whose size is

∣ p ∣

.

I^{'}

is called a co-location instance of p (denoted

C I (p)

). The set containing all co-location instances of p is called the co-location table instance of p (denoted

C I s (p)

). Namely,

C I s (p) = {I^{'} ∣ I^{'} \subseteq I, ∣ I^{'} ∣ = ∣ p ∣, C F s (I^{'}) = p, (\forall i_{u} \in I^{'}, \forall i_{v} \in I^{'}) ((i_{u}, i_{v}) \in R)}

(1)

Example 2.

In Figure 1b,

C I s ({A, B, C, D})

={{A1, B1, C1, D1}, {A2, B2, C2, D2}, {A3, B3, C3, D3}}.

The definition of a pattern’s instance is generally on the clique. In this paper, we extend it to be elastic.

Definition 3

(Participation index

P I (p)

). Given a nonempty feature subset p (

p \subseteq F

) and a feature

f_{i}

in p,

f_{i}

’s participation ratio in p is the ratio of distinct instances of

f_{i}

in

C I s (p)

to instances carrying

f_{i}

. Namely,

P R (p, f_{i}) = \frac{∣ {i_{u} ∣ i_{u} \in \cup C I s (p), f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I, f e a (i_{u}) = f_{i}} ∣}

(2)

Furthermore, the participation index of p is the minimum participation ratio of each feature in p. Namely,

P I (p) = m i n_{f_{i} \in p} P R (p, f_{i})

(3)

where the function

m i n (\cdot)

returns the minimum value.

The participation index can be variable while the participation ratio is relatively fixed. In this paper, we update the two with the closeness centrality.

Definition 4

(Co-location pattern). Given a prevalence threshold

m i n_p i

(

0 < m i n_p i \leq 1

), let p (

\emptyset \subset p \subseteq F

) be a pattern. If

P I (p) \geq m i n_p i

, p is called a prevalent co-location pattern (co-location pattern for short).

Example 3.

In Figure 1b,

P I (A, B, C, D)

=

m i n (P R ({A, B, C, D}, A), P R ({A, B, C, D}, B), P R ({A, B, C, D}, C), P R ({A, B, C, D}, D))

=

m i n (3 / 3, 3 / 3, 3 / 3, 3 / 3)

= 1. Assuming

m i n_p i

= 0.5, {A, B, C, D} is a co-location pattern.

Any nonempty subset of the feature set

F = {f_{1}, f_{2}, \dots, f_{m}}

is a pattern (e.g.,

{f_{1}, f_{3}, f_{4}}

). Users are not interested in all subsets but prevalent ones. Therefore, co-location patterns can be checked from subsets of F in turn. However, it is time-consuming because the candidate size is exponential. A workable lemma is generally proposed to prune candidates [11].

Lemma 1

(Antimonotonic of

P I (p)

). Let p and

p^{'}

be two patterns (

p^{'} \subseteq p \subseteq F

). The participation ratio of any feature in

p^{'}

is greater than or equal to the one in p. Namely,

(\forall p \subseteq F) (\forall p^{'} \subseteq p) (f_{i} \in p^{'}) \Rightarrow P R (p^{'}, f_{i}) \geq P R (p, f_{i})

(4)

Furthermore, the participation index of

p^{'}

is greater than or equal to the one of p. Namely,

(\forall p \subseteq F) (\forall p^{'} \subseteq p) (P I (p^{'}) \geq P I (p))

(5)

The proof of Lemma 1 can be seen in [10].

Example 4.

In Figure 1b,

P R ({C, D, E}, D) = 3 / 3 \geq 2 / 3 = P R ({A, B, C, D, E}, D)

. Furthermore,

P I ({C, D, E}) = 3 / 3 \geq 2 / 3 = P I ({A, B, C, D, E})

Lemma 1 declares a size-k pattern can be prevalent only if its all size-(

k - 1

) subsets are prevalent where

2 \leq k \leq ∣ F ∣

. Some classical algorithms such as Join-based and CPI-Tree are driven by this lemma. In this paper, this lemma is not workable. Another pruning strategy with upper bound is given in the later section.

Since the classical definitions and lemma were introduced, the related work is reviewed in the next subsection.

2.2. Review

The classical mining model has focused on neighbor relationship generation and prevalence tests since [12] kicked off the research of spatial co-location pattern mining.

The methods to generate spatial neighbor relationship graphs can be classified.

The first way is based on the distance threshold given by users. The most popular method is on the global static distance threshold. That is to say, if the distance between a pair is not further than the given threshold, the pair is considered to be neighbors of each other. It is workable in evenly distributed data sets. Since unevenly distributed data sets exist, distances are amplified by the Kernel function in some papers [13]. This method, whose algorithm is called

S G C T_K

, is a classical method to differentiate distances in regions with different distribution densities. It can be a benchmark algorithm in comparison with our methods. Furthermore, local dynamic methods are introduced to fit different distributions in local regions [14]. The underlying assumption is that instances are distributed evenly in each local region.

The second way is based on the k-nearest neighbors. Ref. [15] proposed a hierarchical co-location pattern mining framework by considering both varieties of neighborhood distances and spatial heterogeneity. By adopting a k-nearest neighbor graph (KNNG) instead of a distance threshold, it proposes a distance variation coefficient as a new metric to drive the mining process and determine an individual neighbor relationship graph for each region. This method led the co-location pattern mining to regional co-location pattern mining. Thus, it can be a presentation of neighbor relationship generation with the nearest neighbor to adapt different distribution densities. Its corresponding algorithm named

R C M A

is a benchmark schema in comparison of our novel method. It is acknowledged that the Delaunay triangulation and Voronoi diagram are natural ideal tools for nearest neighbors detection. Thus, some scholars [7,16,17,18] are interested in both. They fit Tobler’s First Law of geography well. Luckily, it is not necessary to set parameters. However, they may also lead to that a further instance may be a neighbor but a near one not. For example, A3 is a neighbor of E1, but A1 is not in Figure 1d.

The third way is based on clustering. However, it is difficult to determine parameters (e.g., k for k-means) [19]. Furthermore, an intercluster instance pair may be far nearer than an intracluster one [20]. For example, the distance between A1 and E1 is farther than the one between E1 and B2, but the former pair tends to be in the same cluster while the latter pair does not. It violates Tobler’s First Law of geography.

The last way tries to balance distances and prevalence. Refs. [21,22] concluded the preferences with dynamic neighborhood constraints. Based on this, they defined the mining task as an optimization problem and proposed a greedy algorithm for mining co-location patterns with dynamic neighborhood constraints. However, it is too sensitive for the buffer size given by users, in addition to that, it may also violate Tober’s First Law, while there exists a rare feature corresponding to an edge.

Since it is not easy to define the neighbor relationship set in various distribution data sets, scholars tend to divide global space into regions (or zones) where instances tend to distribute evenly [23,24]. Thus, co-location pattern mining turns into regional (or zonal) co-location pattern mining. However, these methods split the spatial global domains. We focus on co-location pattern mining in the global space in this paper.

The review of the research on the neighbor relationship generation shows the biggest problem is how to adapt different distribution densities of spatial data sets. No matter the distance threshold way or nearest neighbor method, either the ideal threshold does not exist or it is difficult to lock, or it is vulnerable to noise interference. Our strategy is a compromise of the distance threshold way and the nearest neighbor method.

Once a neighbor relationship graph is generated, there are three main steps to check the prevalence of patterns. Firstly, candidate co-location patterns are generated. Since co-location pattern mining comes from association analysis, the downward-closure or partial down-closure property [25] are expected to prune candidates. Secondly, instances of every candidate pattern are generally collected on cliques. The above two steps are generally crossed. More and more scholars realize clique-based instances are too strict because every pair of instances should be neighbors of each other in a clique. Thus, the star neighbor relationship [6], triangle-based relationship [16], and so on are introduced. In other words, there is a lack of effective integration (or extension) of relevant methods. Thirdly, prevalence is always measured on the minimum participation ratio of features in each candidate pattern (or maximal participation ratio of features in each candidate pattern with rare features) [26]. Furthermore, ref. [27] proposes a new measure called fraction-score whose idea is to count instances fractionally if they overlap. However, The relationship between the prevalence metric and the neighbor relationship set is simply split by the above methods.

Through reviewing the research on prevalent pattern validation, we find that the biggest problem is the contradiction between the users’ personalized interests and the redundancy of the mining results. To catch expected patterns, users have to either change the spatial neighbor relationships or resize the prevalence threshold. The change in spatial neighbor relationships must inevitably aggravate the difficulty of generating spatial neighbor relationships due to different distribution densities. The resizing leads to redundancy of output patterns.

Ref. [28] notes that users are not only interested in identifying the prevalence of a feature set but also in the dominant features. They focus on mining dominant features in every co-location pattern on the changes of participation ratios of each feature in size-wised patterns. However, scholars have not distinguished between strong and weak neighbor relationships for prevalence pattern validation, let alone the effect of topology among features in patterns.

To sum up, (a) the traditional neighbor relationship generation pays more and more attention to unevenly distributed data sets, but it still has a long way to go. (b) Scholars try to improve efficiency of pattern prevalence check or focus on optimizing participation index, but users’ interests are compromised by redundancy of output patterns. (c) The research on feature topology in prevalent patterns has started, but it is stagnant.

3. Spatial Mutual Neighbor Relationship Graph

In this section, directed neighbors of each instance are segmented from data sets according to both their nearest neighbor except itself and a given threshold

α

, and then, mutual neighbor relationships are checked from them.

3.1. Segmentation

For an instance, the distances between it and its neighbors tend to be similar but not remarkably different. This is our assumption in this paper. In other words, the distance between an instance and its nearest neighbor except itself determines its possible neighbors.

Definition 5

(Inside radius). Given a spatial data set D with an instance set

I = {i_{1}, i_{2}, \dots, i_{n}}

and a feature set

F = {f_{1}, f_{2}, \dots, f_{m}}

, let

i_{u}

be an instance (

i_{u} \in I

). The inside radius of

i_{u}

(denoted

I R (i_{u})

) is the distance between

i_{u}

and its nearest neighbor except itself. Namely,

I R (i_{u}) = m i n_{i_{v} \in I \ i_{u}} d i s (i_{u}, i_{v})

(6)

where

d i s (i_{u}, i_{v})

returns the distance (e.g., Euclidean distance) between instance

i_{u}

and

i_{v}

.

Intuitively,

I R (i_{u})

estimates the distance between

i_{u}

and its possible neighbors, namely, all possible neighbors of

i_{u}

may not much farther than

I R (i_{u})

from

i_{u}

.

Definition 6

(Outer radius). Given an elastic coefficient α (

α \geq 1

) by the user, let

I R (i_{u})

be the inside radius of instance

i_{u}

. The outer radius of

i_{u}

(denoted

O R_{α} (i_{u})

) is

I R (i_{u}) * α

. Namely,

O R_{α} (i_{u}) = I R (i_{u}) * α

(7)

Definition 7

(Directed neighbors). The directed neighbors of the instance

i_{u}

(denoted

D N s_{α} (i_{u})

) are composed of instances, the distances between which and

i_{u}

are not further than the outer radius of

i_{u}

. Namely,

D N s_{α} (i_{u}) = {i_{v} ∣ i_{v} \in I, d i s (i_{u}, i_{v}) \leq O R_{α} (i_{u})}

(8)

Definition 7 segments instances on the outer radius to filter directed neighbors for each instance. Once the directed neighbors of each instance are detected, mutual neighbors can be checked.

Definition 8

(Mutual neighbors). For a pair of instances

i_{u}

and

i_{v}

,

i_{u}

and

i_{v}

are mutual neighbors of each other if and only if both

i_{v}

is a directed neighbor of

i_{u}

and

i_{u}

is a directed neighbor of

i_{v}

. Namely,

i_{v} \in M N s_{α} (i_{u}) \land i_{u} \in M N s_{α} (i_{v}) \Leftrightarrow i_{v} \in D N s_{α} (i_{u}) \land i_{u} \in D N s_{α} (i_{v})

(9)

where

M N S_{α} (i_{u})

is a set composed of mutual neighbors of

i_{u}

.

The user can adjust the neighbor scale by adjusting

α

. The biggest advantage of this method is that the user only needs to adjust the intuitive parameters according to their own preferences and does not need to pay attention to the difference of distribution densities in regions.

Definition 9

(Mutual neighbor relationship graph). Given an instance set

I = (i_{1}, i_{2}, \dots, i_{n})

in a data set D, the mutual neighbor relationship graph G (obviously an undigraph) is defined as follows:

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M N s_{α} (i_{u})})

(10)

where the instance set I is the node set of G,

{(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M N s_{α} (i_{u})}

is the edge set of G.

3.2. Problem Statement

Based on the definitions above, we give a formal description to generate a mutual neighbor relationship graph in the following.

Given: (1) a spatial data set D with a feature set F and an instance set I; (2) an elastic coefficient

α

(

α \geq 1

) for segmentation.

Find: The spatial mutual neighbor ship graph

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M N s_{α} (i_{u})})

.

Constraints: Each edge e in G (

e \in {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M N s_{α} (i_{u})}

) should not be longer than

α

times of any inside radius of its endpoints.

3.3. Generating Mutual Neighbor Relationship Graph on KD-Tree

Since the mutual neighbor relationship graph is strongly correlated with the nearest neighbors, a k-dimension tree (KD-Tree) is used to store instance information in the given data sets. The algorithm is as shown in Algorithm 1. All instances’ location information is used to generate a KD-Tree [29] in step 2. For each instance, its nearest neighbor except itself and inside radius are discovered on KD-Tree, and then its directed neighbors are found on the same tree in steps, from step 3 to step 5. For each instance, its mutual neighbors are searched in steps, from step 6 to step 12.

Algorithm 1 Generating mutual neighbor relationship graph on KD-Tree.

Require:D, F, I,

α

.

Ensure:

G = (I, E)

1:

E = \emptyset

2: tree = KD-Tree(D)

3: for

i_{u} \in I

do

4:

D N s_{α} (i_{u}) = t r e e . q u e r y (i_{u} . x y, α)

//Definition 7.

5: end for

6: for

i_{u} \in I

do

7: for

i_{v} \in D N s_{α} (i_{u})

do

8: if

i_{u} \in D N s_{α} (i_{v})

then

9:

E . a d d ((i_{u}, i_{v}))

//Definition 8.

10: end if

11: end for

12: end for

13: return G = (I,E) //Definition 9.

Algorithm 1 is highly efficient. Step 2 costs

O (l o g^{2} ∣ I ∣)

. Assuming every instance has k (

1 \leq k \leq m

) directed neighbors in average, step 4 costs

O (∣ I ∣^{1 - 1 / k} + m)

where m is the number of the nearest instances to be searched each time. Additionally,

m = \frac{2 ∣ E ∣}{∣ I ∣}

in average. Thus, steps from step 3 to step 5 cost

O ((∣ I ∣^{1 - 1 / k} + m) \times ∣ I ∣)

(i.e.,

O (∣ I ∣^{2})

approximately) because

∣ I ∣ > > m

in KD-Tree. Steps from step 6 to step 12 cost

O (k ∣ I ∣ / 2)

since steps from step 7 to step 11 cost

O (k / 2)

. Therefore, this algorithm costs

O (∣ I ∣^{2})

dominated by step 6 to step 12.

Step 4 guarantees the directed neighbors of each instance are correct. Steps from step 6 to step 12 guarantee mutual neighbors are symmetrical.

Thus, Algorithm 1 is correct and efficient to generate mutual neighbors of instances.

4. Prevalence Check on Closeness Centrality

In this section, we define instances of patterns on the mutual neighbor relationship graph with closeness centrality. Furthermore, the prevalence of patterns is checked on their instances. Accordingly, an efficient algorithm is proposed based on the size-wised search way to mine prevalent patterns.

4.1. Definitions and Theorems

Definition 10

(Closeness centrality). Given a spatial mutual neighbor relationship graph

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})}

, let

I^{'}

be a nonempty subset of I and

i_{u}

be an instance in

I^{'}

(

i_{u} \in I^{'}

). The closeness centrality of the instance

i_{u}

in

I^{'}

measures its average farness (inverse distance) to all other nodes in

I^{'}

. Namely,

C C (I^{'}, i_{u}) = \frac{∣ I^{'} ∣ - 1}{\sum_{i_{v} \in I^{'}} ∣ s p (i_{u}, i_{v}) ∣}

(11)

where

∣ s p (i_{u}, i_{v}) ∣

returns the shortest path length [30] between

i_{u}

and

i_{v}

in G but not in the induced subgraph of

I^{'}

in G.

Example 5.

In Figure 2b,

C C ({A 1, E 1}, A 1) = \frac{2 - 1}{2} = 1 / 2

.

Definition 10 evaluates the difficulty of reachability to instances in

I^{'}

from

i_{u}

. The lower the

C C (I^{'}, i_{u})

, the easier the reachability.

Particularly, if there exists an instance pair

i_{u}

and

i_{v}

in

I^{'}

and

i_{v} \notin M R s_{α} (i_{u})

, let

C C (I^{'}, i_{u}) = 0

because

∣ s p (i_{u}, i_{v}) ∣ = \infty

. Moreover, let

C C ({i_{u}}, i_{u}) = 1

, as the instance

i_{u}

can be directly reachable to itself.

Definition 11

(Minimum closeness centrality). Given a spatial mutual neighbor relationship graph

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})}

, let

I^{'}

be a subset of I. The minimum closeness centrality of

I^{'}

is the minimum closeness centrality of instances in

I^{'}

. Namely,

M C C (I^{'}) = m i n_{i_{u} \in I^{'}} C C (I^{'}, i_{u})

(12)

Example 6.

In Figure 2b,

C C ({A 1, C 2, E 1}) = m i n (\frac{3 - 1}{2 + 4}, \frac{3 - 1}{2 + 2}, \frac{3 - 1}{2 + 4}) = 1 / 3

.

Definition 11 reveals the correlation of instances in

I^{'}

. Greater

M C C (I^{'})

means a stronger correlation.

Particularly, let

M C C (\emptyset) = 1

and

M C C ({i_{u}}) = 1

where

i_{u} \in I

. If there is an instance pair in

I^{'}

(

I^{'} \subseteq I

) cannot be reachable to each other in the mutual neighbor relationship graph; there is

M C C (I^{'}) = 0

because

(\exists i_{u} \in I^{'}) (C C (I^{'}, i_{u}) = 0)

. If any instance pair in

I^{'}

can be reachable to each other,

(\forall i_{u} \in I^{'}) (0 \leq C C (I^{'}, i_{u}) \leq 1)

because

(\forall i_{u} \in I^{'}) (\forall i_{v} \in I^{'}) (i_{u} \neq i_{v} \Rightarrow 1 \leq ∣ s p (i_{u}, i_{v}) ∣ \leq ∣ I^{'} ∣ - 1)

. Therefore,

(\forall I^{'} \subset I) (0 \leq M C C (I^{'}) \leq 1)

.

Lemma 2

(Partial antimonotonicity of minimum closeness centrality). Given a spatial mutual neighbor relationship graph

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})}

, let

I^{'}

be a size-k subset of I (

I^{'} \subseteq I \land k > = 1

). There must exist a size-(

k - 1

) subset of

I^{'}

(denoted

I^{″}

) to make

M C C (I^{″}) \geq M C C (I^{'})

. Namely,

(\forall I^{'} \subseteq I) (\exists I^{″} \subset I^{'}) (∣ I^{'} ∣ \geq 1 \land ∣ I^{'} ∣ = ∣ I^{″} ∣ + 1 \to M C C (I^{″}) \geq (M C C (I^{'})))

(13)

Proof of Lemma 2.

If

M C C (I^{″}) = 0

, it is obviously true.

As shown in Figure 3, assuming

G^{'} = (I^{'}, M R s^{'})

is a subgraph in the mutual neighbor relationship graph

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M N s_{α} (i_{u})})

where

I^{'} \subseteq I

and

M R s^{'} = {(i_{u}, i_{v}) ∣ i_{u} \in I^{'} \land i_{v} \in I^{'} \land ∣ s p (i_{u}, i_{v}) ∣ < \infty}

, let

i_{v}

be the farthest instance from

i_{u}

in

G^{'} = (I^{'}, M R s^{'})

, namely,

(\forall i_{w} \in I^{'}) (∣ s p (i_{u}, i_{v}) ∣ \geq ∣ s p (i_{u}, i_{w}) ∣)

. Moreover, let

i_{u}

be the instance making

M C C (I^{″}) = C C (I^{″}, i_{u})

where

I^{″} = I^{'} \ i_{v}

and

M R s^{″} = M R s^{'} - {(i_{v}, i_{w}) ∣ i_{w} \in I^{'}}

. Assuming

∣ s p (i_{u}, i_{v}) ∣ = κ

, then,

M C C (I^{″}) = C C (I^{″}, i_{u}) = \frac{η - 1}{σ}

where

η = ∣ I^{″} ∣

and

σ = \sum_{i_{w} \in I^{″}} ∣ s p (i_{u}, i_{w}) ∣

. Therefore,

M C C (I^{'}) \leq C C (I^{'}, i_{u}) = \frac{η - 1 + 1}{σ + κ} = \frac{η}{σ + κ}

. Thus,

\begin{matrix} M C C (I^{″}) - M C C (I^{'}) \\ \geq & \frac{η - 1}{σ} - \frac{η}{σ + κ} \\ = & \frac{η σ - σ + η κ - κ - η σ}{σ (σ + κ)} \\ = & \frac{η κ - σ - κ}{σ (σ + κ)} \end{matrix}

∵ ∣ s p (i_{u}, i_{v}) ∣ = κ

∴ (\forall δ \in N^{*}) (\exists i_{s} \in I^{'}) (1 \geq δ \geq κ, ∣ s p (i_{u}, i_{s}) ∣ = δ)

and

κ \leq η - 1

.

∴ η \geq κ + 1 \land σ \leq 1 + 2 + \dots + (κ - 1) + κ (η - 1 - (κ - 1))

. Thus,

\begin{matrix} η κ - σ - κ \\ \geq & η κ - (1 + 2 + \dots + (κ - 1) + κ (η - 1 - (κ - 1))) - κ \\ = & \frac{κ^{2} + κ - 2}{2} \\ ∵ & κ \geq 1 \\ ∴ & κ^{2} + κ - 2 \geq 0 \\ ∴ & \frac{κ^{2} + κ - 2}{2} \geq 0 \\ ∴ & η κ - σ - κ \geq 0 \\ ∵ & σ (σ + κ) \geq 0 \\ ∴ & \frac{η κ - σ - κ}{σ (σ + κ)} \geq 0 \\ ∴ & M C C (I^{″}) - M C C (I^{'}) \geq 0 \\ ∴ & (\forall I^{'} \subseteq I) (\exists I^{″} \subset I^{'}) (∣ I^{'} ∣ \geq 1 \land ∣ I^{'} ∣ = ∣ I^{″} ∣ + 1 \to M C C (I^{″}) \geq M C C (I^{'})) \end{matrix}

□

Lemma 2 declares that if there is no subset whose minimum closeness centrality is less than or equal to a given float

β

, any of its superset besides itself cannot be less than or equal to

β

.

Example 7.

In Figure 2b,

M C C ({A 1, E 1}) = 1 / 2 < 2 / 3

,

M C C ({A 1, C 2}) = 1 / 4 < 2 / 3

, and

M C C ({C 2, E 1}) = 1 / 2 < 2 / 3

; thus,

M C C ({A 1, C 2, E 1}) < 2 / 3

.

Definition 12

(Type-

β

co-location instance). Given a pattern p (

p \subseteq F

) and

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})})

, let

M C C (I^{'})

be the minimum closeness centrality of

I^{'}

where

I^{'} \subseteq I

. If the corresponding feature set of

I^{'}

is p and

M C C (I^{'}) \geq β

, where β is the given threshold by users,

I^{'}

is called a type-β co-location instance of p. The set composed of all type-β co-location instances of p is called the type-β co-location table instance of p. Namely,

C I s_{β} (p) = {I^{'} ∣ I^{'} \subseteq I, ∣ I^{'} ∣ = ∣ p ∣, C F s (I^{'}) = p, M C C (I^{'}) \geq β}

(14)

Example 8.

In Figure 2b,

C I s_{β} ({A, B, C, D, E}) = {{A 1, B 1, C 1, D 1, E 1}, {A 2, B 2, C 2, D 2, E 1}, {A 3, B 3, C 3, D 3, E 2}}

while

β \leq 4 / 7

.

In comparison with Definition 2, Definition 12 is more flexible for that the minimum closeness centrality of a co-location instance

I^{'}

is 1 (i.e.,

\frac{∣ I^{'} ∣ - 1}{(∣ I^{'} ∣ - 1) * 1}

).

The minimum closeness centrality of a type-

β

co-location instance

I^{'}

measures the maximal distance between instance pairs. The correlation of

I^{'}

can be evaluated. In other words, if

I^{'}

is a type-

β

co-location instance, each pair of instances in

I^{'}

can be reachable from each other in an upper-bounder step.

Theorem 1

(Necessary condition of type-

β

co-location instance). Given an instance set

I^{'}

in

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})})

, if

I^{'}

is a type-β co-location instance, any pair of instance in

I^{'}

can be reachable to each other in

⌈ \frac{2}{β} ⌉ - 1

steps. Namely,

\begin{matrix} M C C (I^{'}) \geq β \Rightarrow (\forall i_{u} \in I^{'}) (\forall i_{v} \in I^{'}) (0 \leq ∣ s p (i_{u}, i_{v}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1) \end{matrix}

(15)

Proof of Theorem 1.

Given an instance set

I^{'}

in

G = (I, {(i_{u}, i_{v}) ∣ i_{u} \in I, i_{v} \in M R s_{α} (i_{u})})

, if

I^{'}

is a type-

β

co-location instance,

1 \geq M C C (I^{'}) \geq β

. Let

η = ∣ I^{'} ∣

. Let

i_{0}, i_{1}, \dots, i_{κ}

be the longest shortest path between any pair of instances in

I^{'}

. Thus,

i_{0} \in I^{'} \land i_{κ} \in I^{'}

. Furthermore,

C C (I^{'}, i_{0}) \geq β \land C C (I^{'}, i_{κ}) \geq β

. Moreover, for any instance

i_{u}

in

I^{'}

(

i_{u} \in I^{'}

), there must be

∣ s p (i_{0}, i_{u}) ∣ + ∣ s p (i_{u}, i_{κ}) ∣ \geq ∣ s p (i_{0}, i_{k}) ∣ = ∣ {i_{0}, i_{1}, \dots, i_{k}} ∣ = κ

. Therefore, to adopt the maximal

κ

in

I^{'}

, let any instance

i_{u}

in

I^{'}

(

i_{u} \in I^{'}

), making

∣ s p (i_{0}, i_{u}) ∣ + ∣ s p (i_{u}, i_{κ}) ∣ = κ

while letting

M C C (I^{'}) = β

. That is to say,

M C C (I^{'}) = \frac{η - 1}{⌈ \frac{η κ}{2} ⌉} = β

makes

κ

be maximal. Thus,

κ \leq ⌊ \frac{2}{β} - \frac{2}{η β} ⌋

. Because

κ \in N^{*}

and

0 \leq β \leq 1

,

η β

can be much larger than

β

. Thus,

\frac{2}{β} > > \frac{2}{η β}

. Therefore,

(\forall i_{u} \in I^{'}) (\forall i_{v} \in I^{'}) (1 \leq ∣ s p (i_{u}, i_{v}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1)

. □

Theorem 1 declares that there exists a strong correlation among type-

β

co-location instances. For example, if

β = 1

, any pair of instances can be reachable to each other in 1 (i.e.,

⌈ \frac{2}{1} ⌉ - 1

) step. It is a clique at least. Similarly, if

β \geq 4 / 5

, any pair of instances can be reachable to each other in 2 (i.e.,

⌈ \frac{2}{4 / 5} ⌉ - 1

) steps. It is a star at least. If

β \geq 1 / 2

, any pair of instances can be reachable to each other in 3 (i.e.,

⌈ \frac{2}{1 / 2} ⌉ - 1

) steps. If

β \geq 2 / 5

, any pair of instances can be reachable to each other in 4 (i.e.,

⌈ \frac{2}{2 / 5} ⌉ - 1

) steps. If

β \geq 1 / 3

, any pair of instances can be reachable to each other in 5 (i.e.,

⌈ \frac{2}{1 / 3} ⌉ - 1

) steps. The rest can be carried out in the same manner.

Unfortunately, if every instance pair in an instance subset

I^{'}

can be reachable to each other in

⌈ \frac{2}{β} ⌉ - 1

steps,

M C C (I^{'}) \geq β

is not necessarily true. That is to say, it is a necessary condition but not necessarily sufficient.

Example 9.

M C C ({A 1, B 1, C 1, D 1, E 1}) = 4 / 7 \Rightarrow (\forall i_{u} \in {A 1, B 1, C 1, D 1, E 1}) (\forall i_{v} \in {A 1, B 1, C 1, D 1, E 1}) (∣ s p (i_{u}, i_{v}) ∣ \leq ⌈ \frac{2}{7 / 4} ⌉ - 1 = 3)

. On the contrary,

⌈ \frac{2}{β} ⌉ - 1 = 2

but

M C C ({A 1, E 1}) = \frac{2 - 1}{2} = \frac{1}{2} < β

when β = 2/3 in Figure 2b.

According to Lemma 2 and Theorem 1, candidate type-

β

co-location instance can be pruned well. That is to say, if there is an instance pair in a candidate type-

β

co-location, instance

I^{'}

cannot be reachable to each other in

⌈ \frac{2}{β} ⌉ - 1

steps, so it cannot be true. Furthermore, if there is no size-(

∣ I^{'} ∣ - 1

) subset of

I^{'}

being a type-

β

co-location instance, it still must not be a type-

β

co-location instance.

Definition 13

(Type-

β

participation ratio). Let

C I s_{β} (p)

be the type-β co-location table instance of a pattern p. The type-β participation ratio of a feature

f_{i}

in p is the ratio of closeness centrality summary of

f_{i}

’s instances appearing in

C I s_{β} (p)

to instances of

f_{i}

. Namely,

\begin{matrix} P R_{β} (p, f_{i}) = & \frac{\sum_{i_{u} \in \cup C I s_{β} (p) \land f e a (i_{u}) = f_{i}} C C (I^{'}, i_{u})}{∣ {i_{u} ∣ i_{u} \in I \land i_{u} \notin \cup C I s_{β} (p) \land f e a (i_{u}) = f_{i}} + ∣ C I s_{β} (p) ∣} \end{matrix}

(16)

Example 10.

P R_{β} ({A, B, C, D, E}, E) = \frac{4 / 7 + 4 / 7 + 4 / 6}{3} = 40 / 63

in Figure 2b while

β \leq 4 / 7

. It is different from

P R ({A, B, C, D, E}, E) = 0

in Definition 3.

The reason why not to use

P R_{β} (p, f_{i}) = \frac{\sum_{i_{u} \in \cup C I s_{β} (p) \land f e a (i_{u}) = f_{i}} C C (I^{'}, i_{u})}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}

is that an instance may repeatedly appear in different type-

β

co-location instances.

Definition 14

(Type-

β

participation index). Given a pattern p (

p \subseteq F

), its type-β participation index is the minimum type-β participation ratio of features in p. Namely,

P I_{β} (p) = m i n_{f_{i} \in p} P R_{β} (p, f_{i})

(17)

Example 11.

P I_{β} ({A, B, C, D, E}) = m i n (\frac{4 / 5 + 4 / 5 + 4 / 5}{3}, \frac{4 / 5 + 4 / 4 + 4 / 5}{3}, \frac{4 / 5 + 4 / 5 + 4 / 4}{3}, \frac{4 / 4 + 4 / 4 + 4 / 4}{3}, \frac{4 / 7 + 4 / 6 + 4 / 6}{3}) = 40 / 63

in Figure 2b while

β \leq 4 / 7

. It is different from

P I ({A, B, C, D, E}) = 0

in Definition 3.

Definition 15

(Type-

β

co-location pattern). Given a pattern p (

p \subseteq F

) and a prevalence threshold ζ, if and only if

P I_{β} (p) \geq ζ

, we call p a type-β co-location pattern. The set composed of all type-β co-location patterns is denoted

C P s_{β}

. Namely,

P I_{β} (p) \geq ζ \Leftrightarrow p \in C P s_{β}

(18)

Example 12.

{A, B, C, D, E} is a type-β co-location pattern when

β \leq 4 / 7 \land ζ \leq 40 / 63

, but it is not a co-location pattern in Definition 4 when

m i n_p i > 0

in Figure 2b.

Perhaps any nonempty subset of F can be theoretically a candidate type-

β

co-location pattern. If all

2^{∣ F ∣} - ∣ F ∣ - 1

candidate patterns are checked in turn with their type-

β

co-location instance generation, it is time-consuming. In comparison with Lemma 1, the a priori property is not satisfied to type-

β

co-location patterns because of Lemma 2. Thus, we firstly introduce the approximate type-

β

co-location pattern to avoid combination explosion, and then propose a new property in Theorem 1.

Definition 16

(Approximate type-

β

co-location pattern). Given a pattern p (

p \subseteq F

) and a prevalence threshold ζ, if its type-β participation index is greater than or equal to ζ when its type-β co-location instances are relaxed as instance pairs and can be reachable to each other in

⌈ \frac{2}{β} ⌉ - 1

steps, p is called an approximate type-β co-location pattern. The set composed of approximate type-β co-location patterns is denoted by

A P s_{β}

. Namely,

m i n_{f_{i} \in p} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}} \geq ζ \Leftrightarrow p \in A P s_{β}

(19)

where

U I s_{β} (p) = {I^{'} ∣ (\forall i_{v} \in I^{'}) (\forall i_{w} \in I^{'}) (I^{'} \subseteq I \land {f e a (i_{s}) ∣ i_{s} \in I^{'}} = p \land ∣ I^{'} ∣ = ∣ p ∣ \land ∣ s p (i_{v}, i_{w}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1)}

.

Example 13.

{A, B, C, D, E, F} \in A P s_{β}

when

β \leq 4 / 7

and

ζ \leq 2 / 3

, but

P I_{β} ({A, B, C, D, E, F})

= 5 / 21

.

Theorem 2

(Downward closure of approximate type-

β

co-location pattern). Given a subset p of the feature set F (

p \subseteq F

), if p is an approximate type-β co-location pattern, any subset of p must also be an approximate type-β co-location pattern. Namely,

\begin{matrix} (\forall p \subseteq F) (\forall p^{'} \subseteq p) (p \in A P s_{β} \to p^{'} \in A P s_{β}) \end{matrix}

(20)

Proof of Theorem 2.

Given two subsets p and

p^{'}

of the feature set F (

p^{'} \subset p \subseteq F

), let

f_{i}

be a feature in

p^{'}

(

f_{i} \in p^{'} \land f_{i} \in p

). Assuming

I^{'}

is any instance subset satisfying to

(\forall i_{v} \in I^{'}) (\forall i_{w} \in I^{'}) (I^{'} \subseteq I \land {f e a (i_{s}) ∣ i_{s} \in I^{'}} = p \land ∣ I^{'} ∣ = ∣ p ∣ \land ∣ s p (i_{v}, i_{w}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1)

, it is understandable that any subset of

I^{″}

, whose corresponding feature set is

p^{'}

, can satisfy to

(\forall i_{v} \in I^{″}) (\forall i_{w} \in I^{″}) (I^{″} \subseteq I^{'} \land {f e a (i_{s}) ∣ i_{s} \in I^{″}} = p^{'} \land ∣ I^{″} ∣ = ∣ p^{'} ∣ \land ∣ s p (i_{v}, i_{w}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1)

. That is to say, if an instance

i_{u}

whose feature is

f_{i}

appears in

I^{'}

, it also must appear in

I^{″}

. Thus,

\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣} \leq \frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p^{'}) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}

, and then

\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣} \geq ζ \Rightarrow \frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p^{'}) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣} \geq ζ

. A more detailed proof can be modeled on the proof of the antimonotonicity of participation ratios and participation indexes in [31].

Furthermore,

\begin{matrix} m i n_{f_{i} \in p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p^{'}) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}} \\ \geq & m i n {m i n_{f_{i} \in p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p^{'}) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}}, \\ m i n_{f_{j} \in p - p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{j}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{j}} ∣}} \\ \geq & m i n {m i n_{f_{i} \in p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}}, \\ m i n_{f_{j} \in p - p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{j}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{j}} ∣}} \\ = & m i n {m i n_{f_{i} \in p} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}} \end{matrix}

Namely,

\begin{matrix} m i n_{f_{i} \in p} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}} \geq ζ \land p^{'} \subseteq p \\ \Rightarrow & m i n_{f_{i} \in p^{'}} {\frac{∣ {i_{u} ∣ i_{u} \in \cup U I s_{β} (p^{'}) \land f e a (i_{u}) = f_{i}} ∣}{∣ {i_{u} ∣ i_{u} \in I \land f e a (i_{u}) = f_{i}} ∣}} \geq ζ \end{matrix}

Thus,

(\forall p \subseteq F) (\forall p^{'} \subseteq p) (p \in A P s_{β} \to p^{'} \in A P s_{β})

□

Example 14.

{A, B, C, D, E, F} \in A P s_{β}

when

β \leq 4 / 7

and

ζ \leq 2 / 3

, so are any subset of

{A, B, C, D, E, F}

.

By comparing Definitions 15 and 16, Theorem 2 declares that a pattern cannot be a type-

β

co-location pattern until there is no subset not being an approximate type-

β

co-location pattern, since the closeness centrality of an instance is greater than or equal to 0 and less than or equal to 1. At this point, this type-

β

co-location pattern mining problem can be transformed into the classical co-location pattern mining problem. Therefore, the majority of traditional co-location pattern mining algorithms such as Join-based, Join-less [32], CPI-tree, and so on can be improved to mine type-

β

co-location patterns.

For a type-

β

co-location instance, each instance has closeness centrality. Accordingly, for a type-

β

co-location pattern, each feature has closeness centrality.

Definition 17

(Closeness centrality of

f_{i}

(

f_{i} \in p, p \subseteq F

)). Given a type-β co-location pattern p, let

f_{i}

be any feature in p. The closeness centrality of

f_{i}

in p is the average closeness centrality of instances carrying

f_{i}

in its type-β co-location instances. Namely

C C F (p, f_{i}) = \frac{\sum_{i_{u} \in I^{'}, I^{'} \in C I s_{β} (p), f e a (i_{u}) = f_{i}} C C (I^{'}, i_{u}))}{∣ C I s_{β} (p) ∣}

(21)

Example 15.

C C F ({A, B, C, D, E}, A) = 4 / 5 = 0.8

,

C C F ({A, B, C, D, E}, B) = 13 / 15 \approx 0.87

,

C C F ({A, B, C, D, E}, C) = 13 / 15 \approx 0.87

,

C C F ({A, B, C, D, E}, D) = 1

, and

C C F ({A, B, C, D, E}, E) = 40 / 63 \approx 0.63

when

β \leq 4 / 7

and

ζ \leq 40 / 63

in Figure 2b.

The reason to not let the denominator be

∣ {i_{u} ∣ (\exists s \in C I s (p)) (i_{u} \in s, f e a (i_{u}) = f_{i})} ∣

instead of

∣ C I s (p) ∣

to be more intuitive is that an instance

i_{u}

may appear in different type-

β

co-location instances of p.

Since the closeness centrality of every instance in a type-

β

co-location pattern instance is greater than or equal to

β

according to Definition 12, the closeness centrality of every feature in a type-

β

co-location pattern must also be greater than or equal to

β

according to Definition 17. Furthermore, the closeness centrality of each feature in a type-

β

co-location pattern may be different. That is to say, in a type-

β

co-location pattern, the correlation between a feature and the other feature may be different. Some features may be in the center of topology of the pattern while the others on the edge. The feature with high closeness centrality are more correlated to other features. Obviously, the closer

C C F (p, f_{i})

reaches to 1, the higher the correlation with the other features

f_{i}

has in p when the distribution of the whole data set is taken into account.

The closeness centrality of instances in an instance subset

I^{'}

can be different even if

I^{'}

is a clique. Therefore, Definition 17 can be improved. By the way, this does not apply to Definition 12. In this paper, we focus on Definition 17 but not Definition 18.

Definition 18

(Extended closeness centrality of

f_{i}

(

f_{i} \in p, p \subseteq F

)). Given a type-β co-location pattern p, let

f_{i}

be any feature in p. The extended closeness centrality of

f_{i}

in p is the average closeness centrality on distances of instances carrying

f_{i}

in its type-β co-location instances. Namely,

E C C F (p, f_{i}) = \frac{\sum_{i_{u} \in I^{'}, I^{'} \in C I s_{β} (p), f e a (i_{u}) = f_{i}} E C C (I^{'}, i_{u})}{∣ C I s_{β} (p) ∣}

(22)

where

E C C (I^{'}, i_{u}) = \frac{m i n_w e i g h t s (I^{'})}{\sum_{i_{v} \in I^{'}} d i s (i_{u}, i_{v})}

and

m i n_w e i g h t s (I^{'})

is the summary of edge weights in the minimum spanning tree [33] of

G^{'} = (I^{'}, {(i_{u}, i_{v}) ∣ i_{u} \in I^{'} \land i_{v} \in I^{'}}, {d i s (i_{u}, i_{v}) ∣ i_{u} \in I^{'} \land i_{v}

\in I^{'}})

.

Example 16.

E C C F ({A, B, C}, A) = \frac{(3 + 3.2) / (3 + 3.2) + (5.6 + 5.1) / (5.6 + 5.1) + (11.2 + 11.6) / (11.6 + 16.2)}{3} \approx 0.94

,

E C C F ({A, B, C}, B) = \frac{(3 + 3.2) / (3 + 3.6) + (5.6 + 5.1) / (5.6 + 5.8) + (11.2 + 11.6) / (11.6 + 11.2)}{3} \approx 0.96

,

E C C F ({A, B, C}, C) = \frac{(3 + 3.2) / (3.2 + 3.6) + (5.6 + 5.1) / (5.1 + 5.8) + (11.2 + 11.6) / (11.2 + 16.2)}{3} \approx 0.91

when

β \leq 1

and

ζ \leq 1

, but

C C F ({A, B, C}, A) = 1

,

C C F ({A, B, C}, B) = 1

,

C C F ({A, B, C}, C) = 1

in Figure 2b.

Obviously,

0 \leq E C C (I^{'}, i_{u}) \leq 1

for any instance

i_{u}

in an instance subset

I^{'}

. Furthermore, Definition 18 normalizes the closeness centrality of all features in their corresponding patterns. That is to say, the closer

E C C F (p, f_{i})

reaches to 1, the highercorrelation between

f_{i}

and the other features in p when the distributions of type-

β

co-location instances are just taken into account, rather than of all instances.

Since the closeness centrality of each feature in prevalent patterns is expected to be evaluated, an algorithm based on Join-less in a size-wised manner rather than in a maximal pattern [34] finding way is proposed in this paper.

4.2. Type- $β$ Co-Location Pattern Mining

Given that Join-less does not take closeness centrality into account, it should be improved in aspects as follows. (a) Path lengths between instance pairs should be calculated by the shortest path algorithms such as the algorithm of Dijkstra in

G = (V, E)

(the output of Algorithm 1), and then, if the path length between an instance pair is not greater than

⌈ \frac{2}{β} ⌉ - 1

, update an edge between the pair of instances according to Theorem 1. (b) All approximate type-

β

co-location patterns and their instances are adopted on the updated

G = (V, E)

by Join-less according to Theorem 2. (c) Check the prevalence of each approximate type-

β

co-location pattern on their instances with Theorem 1 and Definition 15. (d) For each type-

β

co-location pattern, the closeness centrality of each feature can be computed in Definition 17 or Definition 18.

Algorithm 2 is used to find type-

β

co-location patterns with their feature closeness centrality values. Next, we read the algorithm and analyze the time complexity. Step 1 generates a mutual neighbor relationship graph by Algorithm 1. It costs

O (∣ I ∣^{2})

. Step 2 returns the shortest path lengths between each instance pair by the algorithm of Dijkstra. It costs

O (∣ E ∣^{2})

. Steps 3 to 9 update the mutual neighbor relationship graph according to Theorem 1. They cost

O (∣ I ∣^{2})

. Since every instance pair between which the shortest path length is not greater than

⌈ \frac{2}{β} ⌉ - 1

can be detected, no approximate type-

β

co-location pattern can be neglected in Step 10. Step 10 determines the time complexity of the whole algorithm. It is acknowledged that Join-less is efficient, thus so is this algorithm. Steps from 11 to 19 tend to adopt type-

β

co-location patterns and compute its feature closeness centrality from approximate type-

β

co-location patterns. In Step 15, feature closeness centrality is computed according to Lemma 2. This step costs

O (∣ p ∣^{2})

. These 9 steps cost

O (∣ A P s_{β} ∣ \times \frac{∣ I ∣}{∣ F ∣} \times ∣ p ∣^{2})

in average. The whole algorithm is based on proved lemmas and theorems. Thus, it is correct and complete.

Algorithm 2 Mining type-

β

co-location patterns (

β

-CPM)

Require:D (the given data set), F (the feature set), I (the instance set with location information),

α

the elastic coefficient to generate mutual neighbor relationships,

β

the threshold for closeness centrality,

ζ

(the given prevalence threshold).

Ensure:

C P s_{β}

(type-

β

co-location patterns with closeness centrality of each feature).

1: G = Algorithm 1 (D,

α

) //Generate spatial mutual neighbor relationship graph.

2:

{∣ s p (i_{u}, i_{v}) ∣ ∣ i_{u} \in I \land i_{v} \in I} = D i j k s t r a (G)

//Get shortest path lengths between instance pairs.

3: for

i_{u} \in I

do

4: for

i_{v} \in I

do

5: if

∣ s p (i_{u}, i_{v}) ∣ \leq ⌈ \frac{2}{β} ⌉ - 1

then

6:

E = E \cup {(i_{u}, i_{v})}

//Theorem 1.

7: end if

8: end for

9: end for

10:

A P s_{β}, A P s_{β}_i n s = J o i n - l e s s (d a t a = G, m i n_p i = β)

//Theorem 2.

11: for

p \in A P s_{β}

do

12: if

c h e c k_C P s_{β} (p, A P s_{β}_i n s)

then

13:

p_d i c t = {}

14: for

f_{i} \in p

do

15:

p_d i c t . u p d a t e ({f_{i} : C C F (p, f_{i})})

//Definition 17/ 18 and Lemma 2.

16: end for

17:

C P s_{β} = C P s_{β} \cup {p_d i c t}

//Definition 15.

18: end if

19: end for

20: return

C P s_{β}

5. Experiment Analysis

In this section, both synthetic and real data sets are used to design our experiments. The reason for using the synthetic data set is that it is difficult to evaluate what can be the ideal result on the real data set.

Synthetic data setFigure 4a shows the synthetic data set, called Synthetic data set, produced by us. The space is divided into three regions of similar areas. Coordinates of instances are randomly assigned in each region following the distribution densities of 3:8:13. Moreover, instances carrying features in {A, B, C} are deliberately arranged to gather, and so are they in {D, E, F}. The densities of instances of both {A, B, C} and {D, E, F} are higher than their corresponding regional densities (the average distance of pair instance neighbors is not greater than 6 m, 4 m, and 2 m in the corresponding three regions, respectively). In other words, {A, B, C} and {D, E, F} can be interesting patterns but others not.

The Three Parallel River data setFigure 4b shows the distributions of the rare plants in the Three Parallel Rivers (short for the Three Parallel River data set). It is generally sparse, but the density distribution varies greatly.

Table 1 shows the feature and instance distributions. Some key performance indicators are tested as follows.

5.1. Precision, Accuracy, and Recall

The precision, accuracy, and recall are, respectively, tested on the Synthetic data set, while they cannot be objectively evaluated in true data sets. In the Synthetic data set, {A, B, C} and {D, E, F} are thought to be highly correlated along with their non-size-1 subsets. In other words, {A, B}, {A, C}, {B, C}, {A, B, C}, {D, E}, {D, F},{E, F}, and {D, E, F} can be considered as positive patterns, while other

2^{∣ F ∣} - ∣ F ∣ - 1 - 8

non-size-1 ones can be carried out as negative patterns. Based on the mining results, we denote true positive patterns as

T P

, false positive patterns as

F P

, true negative patterns as

T N

, and false negative patterns as

F N

, borrowing definitions from [35]. Along this line, the precision of our algorithm (Algorithm 2) can be proposed as

\frac{∣ T P ∣}{∣ T P ∣ + ∣ F P ∣}

and the accuracy can be performed as

\frac{∣ T P ∣ + ∣ T N ∣}{∣ T P ∣ + ∣ T N ∣ + ∣ F P ∣ + ∣ F N ∣}

, while the recall can be carried out as

\frac{∣ T P ∣}{∣ T P ∣ + ∣ F N ∣}

.

Figure 5 shows the precision, accuracy, and recall of

β

-CPM in different representative parameters in the Synthetic data set in comparison with RMCA [15] and SGCT_K [13]. The precision and recall are difficult to balance, while the accuracy is preferred in RCMA. Meanwhile, the balance points of the precision, accuracy, and recall of SGCT_K are difficult to find by users as they have very narrow ranges. Moreover, since participation indexes are based on Kernel functions in SGCT_K, the optimal participation index threshold is not easy to estimate by users. All positive patterns should be theoretically discovered once

r \geq 6

. However, the SGCT_K cannot perform it because of Kernel functions. Fortunately, our

β

-CPM is robust to work well. That any of

α

,

β

, and

ζ

is optimal can lead to an optimal result.

Figure 5 also reveals that negative patterns are insensitive to

α

and

β

. That is to say, negative patterns are not easy to be discovered by

β

-CPM. Furthermore, a lower

β

can make up for the fact that

α

is too small. On the contrary, a bigger

α

can also make up for the fact that

β

is too big.

5.2. Efficiency

Since spatial data sets are large or even massive, users are also interested in the total time costs of algorithms. Table 2 shows the total time costs of

β

-CPM on the Synthetic data set and the Three Parallel Rivers data set in comparison with RCMA and SGCT_K with optimal parameters. RCMA is time-consuming because its iterations are based on the dynamic k of KNN. If data sets are distributed evenly, it must drive global co-location pattern mining. In other words, it is necessary to repeatedly find patterns to check the similarity of regions to be connected. Thus, it is workable to find regional co-location patterns instead of global ones while distribution densities are different. SGCT_K focuses on the influence of distances on proximity. That is to say, it considers that the closer the more important. It neglects the influence of local distribution densities on proximity. In a word, it is irrelevant to the direction considered of our

β

-CPM in this article.

β

-CPM is more efficient than the other two with similar pattern results because it detects valuable neighbors.

5.3. Density Response

Figure 6 shows the distance distributions of neighbor pairs by

β

-CPM in the Synthetic data set and the Three Parallel Rivers data set in comparison with the ones by SGCT_K. Since RCMA focuses on regional co-location patterns, we no longer compare

β

-CPM with it.

There are almost no singularities in the boxplots of SGCT_K in both Figure 6a,b. The reason is that neighbors are defined on a global static distance threshold, while distances between instance pairs tend to be normally distributed. However, users care so much about prevalent patterns that they may not care about particular neighbor relationships of instances. In other words, there are too many neighbor pairs. More importantly, SGCT_K neglects the influences of distribution densities under Tobler’s First Law of Geography.

β

-CPM discovers fewer neighbor pairs but it still finds interesting patterns. The singularities reveal that there are some sparse areas in the space. It effectively responds to the influence of regional distribution density on Tobler’s First Law of Geography.

5.4. Feature Closeness Centrality

In a traditional prevalent pattern, there are only participation ratios but not closeness centrality values for features. Figure 7 shows example dendrograms of feature closeness centrality in the Synthetic data set and the Three Parallel Rivers data set. It reveals that the antimonotonicity of approximate type-

β

co-location patterns may sometimes lead to the antimonotonicity of type-

β

co-location patterns. Furthermore, different features may have different closeness centrality. For example, A has a higher closeness centrality than B in {A, B}, as shown in Figure 7, while

α

= 2,

β

= 0.3, and

ζ

= 0.5.

To sum up,

β

-CPM is workable in precision, accuracy, recall, efficiency, density response, and feature closeness centrality expression.

6. Conclusions and Discussion

In this paper, type-

β

co-location patterns are discovered in an innovative way. Firstly, spatial mutual neighbor relationships are generated on the idea the distances between an instance and its neighbors are similar but not remarkably different. This method is adaptive to different distribution densities in regions and data sets. This leads to fewer neighbor pairs but effective responses to the interests of users about expected positive patterns. Secondly, instances of interesting patterns are proposed on closeness centrality instead of on cliques or stars. It extends the formats of pattern instances and counteracts the adverse effects of a too small

α

. Users can flexibly set

β

according to their own preferences to control the strength of the correlation of instances. Thirdly, the closeness centrality of each feature in interesting patterns is measured by the closeness centrality of objects in instances of patterns. It can reveal the correlations between each feature and the others in a type-

β

co-location pattern. A feature with higher closeness centrality condenses the type-

β

co-location pattern more.

The same features have different closeness centrality in different type-

β

co-location patterns. Therefore, exploring the law that the closeness centrality of features changes with the size growth of patterns can further reveal the interaction mechanism among spatial features. In addition, as the pattern size grows, features that have had higher closeness centrality suddenly descend to a minimum, and an in-depth analysis of the reasons for this can help us further understand how the patterns survive and how the correlations among features provide vitality for the survival of patterns. All of the above will be our main future work.

Author Contributions

Conceptualization, Muquan Zou and Lizhen Wang; methodology, Muquan Zou; software, Muquan Zou; validation, Muquan Zou, Lizhen Wang and Pingping Wu; formal analysis, Muquan Zou; investigation, Muquan Zou; resources, Vanha Tran; data curation, Pingping Wu; writing—original draft preparation, Muquan Zou; writing—review and editing, Lizhen Wang; visualization, Muquan Zou; supervision, Lizhen Wang; project administration, Lizhen Wang; funding acquisition, Lizhen Wang. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China OF FUNDER grant numbers 61966036, and 62066023. It also was funded by the Project of Innovative Research Team of Yunnan Province OF FUNDER grant number 2018HC019, the Research Project of Kunming University OF FUNDER grant number XJZZ1706, and Li Zhengqiang Expert Workstation of Yunnan Province OF FUNDER grant number 202205AF150031.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interests.

Abbreviations

The following abbreviations are used in this manuscript:

$K N N$	k-nearest neighbors
$C I s$	Co-location table instance
$P I$	Participation index
$I R$	Inside radius
$O R_{α}$	Outer radius limited by $α$
$D N s_{α}$	Directed neighbors limited by $α$
$M N s_{α}$	Mutual neighbors limited by $α$
$C C$	Closeness centrality
$M C C$	The minimum closeness centrality
$C I s_{β}$	Type- $β$ co-location instance
$P R_{β}$	Type- $β$ participation ratio
$P I_{β}$	Type- $β$ participation index
$C P s_{β}$	Type- $β$ co-location patterns
$A P s_{β}$	Approximate type- $β$ co-location patterns
$C C F$	Closeness centrality of a feature
$E C C F$	Extended closeness centrality of a feature
$β$ -CPM	The algorithm mining type- $β$ co-location patterns
RCMA	The regional co-location mining algorithm
SGCT_K	A sparse-graph and condensed tree-based maximal co-location algorithm with a
	Kernel function
$T P$	True positive pattern set
$F P$	False positive pattern set
$T N$	True negative pattern set
$F N$	False negative pattern set

References

Wang, X.; Lei, L.; Wang, L.; Yang, P.; Chen, H. Spatial co-location pattern discovery Incorporating Fuzzy Theory. IEEE Trans. Fuzzy Syst. 2021, 30, 2055–2072. [Google Scholar] [CrossRef]
Wang, L.Z.; Fang, Y.; Zhou, L. Preference-Based Spatial Co-Location Pattern Mining; Big Data Management Series; Springer: Singapore, 2022. [Google Scholar] [CrossRef]
Darwin, C. The Origin of Species; Manchester University Press: Manchester, UK; New York, NY, USA, 1998. [Google Scholar]
Li, J.; Adilmagambetov, A.; Jabbar, M.M.; Zane, O.R.; Osornio-Vargas, A.; Wine, O. On discovering co-Location patterns in datasets: A case study of pollutants and child cancers. Geoinformatica 2016, 20, 651–692. [Google Scholar] [CrossRef]
Tran, V.; Wang, L.; Chen, H.; Xiao, Q. MCHT: A maximal clique and hash table-based maximal prevalent co-location pattern mining algorithm. Expert Syst. Appl. 2021, 175, 114830–114850. [Google Scholar] [CrossRef]
Wang, L.; Bao, X.; Zhou, L.; Chen, H. Mining maximal sub-prevalent co-location patterns. World Wide Web 2019, 22, 1971–1997. [Google Scholar] [CrossRef]
Sundaram, V.M.; Thnagavelu, A.; Paneer, P. Discovering co-location patterns from spatial domain using a delaunay approach. Procedia Eng. 2012, 38, 2832–2845. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Tran, V.; Chen, H. Efficiently mining spatial co-location patterns utilizing fuzzy grid cliques. Inf. Sci. 2022, 592, 361–388. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, J.; Wang, Q.; Zhao, H. Identifying influential nodes in complex networks with community structure. Knowl. Based Syst. 2013, 42, 74–84. [Google Scholar] [CrossRef]
Huang, Y.; Xiong, H.; Shekhar, S.; Pei, J. Mining confident co-location rules without a support threshold. In Proceedings of the 2003 ACM Symposium, San Diego, CA, USA, 10–12 June 2003; pp. 497–502. [Google Scholar]
Batal, I.; Hauskrecht, M. A concise representation of association rules using minimal predictive rules. In Proceedings of the Machine Learning and Knowledge Discovery in Databases ECML PKDD 2010, Berlin/Heidelberg, Germany, 20–24 September 2010; pp. 87–102. [Google Scholar]
Huang, Y.; Shekhar, S.; Xiong, H. Discovering colocation patterns from spatial data sets: A general approach. IEEE Trans. Knowl. Data Eng. 2004, 16, 1472–1485. [Google Scholar] [CrossRef]
Yao, X.; Chen, L.; Peng, L.; Chi, T. A co-location pattern-mining algorithm with a density-weighted distance thresholding consideration. Inf. Sci. 2017, 396, 144–161. [Google Scholar] [CrossRef]
Zhao, J.; Wang, L.; Bao, X.; Tan, Y. Mining co-location patterns with spatial distribution characteristics. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Feng, Q.; Chiew, K.; He, Q.; Huang, H. Mining regional co-location patterns with KNNG. J. Intell. Inf. Syst. 2014, 42, 485–505. [Google Scholar]
Tran, V.; Wang, L.; Chen, H. A spatial co-location pattern mining algorithm without distance thresholds. In Proceedings of the 2019 IEEE International Conference on Big Knowledge (ICBK), Beijing, China, 10–11 November 2019; pp. 242–249. [Google Scholar] [CrossRef]
Wang, J.; Wang, L.; Wang, X. Mining prevalent co-Location patterns based on global topological relations. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 210–215. [Google Scholar] [CrossRef]
Yao, X.; Wang, D.; Peng, L.; Chi, T. An adaptive maximal co-location mining algorithm. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 5551–5554. [Google Scholar] [CrossRef]
Tantrum, J.; Murua, A.; Stuetzle, W. Hierarchical model-based clustering of large datasets through fractionation and refractionation. Inf. Syst. 2004, 29, 315–326. [Google Scholar] [CrossRef]
Zhou, G.; Li, Q.; Deng, G.; Yue, T.; Zhou, X. Mining co-location patterns with clusetering items from spatial data sets. ISPRS—Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-3, 2505–2509. [Google Scholar] [CrossRef]
Qian, F.; Yin, L.; He, Q.; He, J. Mining spatio-temporal co-location patterns with weighted sliding window. In Proceedings of the 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, Shanghai, China, 20–22 November 2009; Volume 3, pp. 181–185. [Google Scholar] [CrossRef]
Tang, M.; Wang, Z. Research of spatial co-location pattern mining based on segmentation threshold weight for big dataset. In Proceedings of the 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM), Fuzhou, China, 8–10 July 2015; pp. 49–54. [Google Scholar] [CrossRef]
Dai, B.R.; Lin, M.Y. Efficiently mining dynamic zonal co-location patterns based on maximal co-locations. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, Vancouver, BC, Canada, 11 December 2011; pp. 861–868. [Google Scholar] [CrossRef]
Agarwal, P.; Verma, R.; Gunturi, V.M.V. Discovering spatial regions of high correlation. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), Barcelona, Spain, 12–15 December 2016; pp. 1082–1089. [Google Scholar] [CrossRef]
Zeng, X.; Li, Z.; Wang, J.; Li, X. High utility co-location patterns mining from spatial dataset with time interval. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 628–636. [Google Scholar] [CrossRef]
Yang, P.; Wang, L.; Wang, X.; Fang, D. An effective approach on mining co-Location patterns from spatial databases with rare features. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 53–62. [Google Scholar] [CrossRef]
Chan, H.K.H.; Long, C.; Yan, D.; Wong, R.C.W. Fraction-score: A new support measure for co-location pattern mining. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1514–1525. [Google Scholar] [CrossRef]
Fang, Y.; Wang, L.; Zhou, L. Mining spatial co-location patterns with key features. J. Data Acquis. Process. 2018, 33, 692–703. [Google Scholar]
Hou, W.; Li, D.; Xu, C.; Zhang, H.; Li, T. An advanced k nearest neighbor classification algorithm based on KD-tree. In Proceedings of the 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), Chongqing, China, 10–12 December 2018; pp. 902–905. [Google Scholar] [CrossRef]
Shee, S.C. Tabular algorithms for the shortest path and longest path. Nanta Math. 1977, 10, 100–105. [Google Scholar]
Shekhar, S.; Huang, Y. Discovering spatial co-location patterns: A summary of results. Lect. Notes Comput. Sci. 2001, 2121, 236–256. [Google Scholar] [CrossRef]
Yoo, J.S.; Shekhar, S. A joinless approach for mining Spatial colocation patterns. IEEE Trans. Knowl. Data Eng. 2006, 18, 1323–1337. [Google Scholar] [CrossRef]
Graham, R.L.; Hell, P. On the history of the minimum spanning tree problem. Ann. Hist. Comput. 1985, 7, 43–57. [Google Scholar] [CrossRef]
Wang, L.; Bao, X.; Chen, H.; Cao, L. Effective lossless condensed representation and discovery of spatial co-location patterns. Inf. Sci. 2018, 436, 197–213. [Google Scholar] [CrossRef]
Buckland, M.K.; Gey, F.C. The relationship between Recall and Precision. J. Assoc. Inf. Sci. Technol. 2010, 45, 12–19. [Google Scholar] [CrossRef]

Figure 1. This is a toy example of spatial data sets. Each letter represents a feature and its following number does an instance of the feature. The coordinate accuracy is in meters. (a) The neighbor relationship graph on distance threshold way when the given distance threshold d = 3 m. (b) The neighbor relationship graph on distance threshold way when the distance threshold d = 13 m. (c) The neighbor relationship graph on mutual-KNN when k = 3. (d) The neighbor relationship graph on Delaunay triangulation.

Figure 2. This is an improved neighbor relationship graph of Figure 1. (a) Directed neighbor relationships are generated on the nearest neighbors and the given coefficient

α

= 2. (b) The digraph is converted into an undigraph on mutual neighbors. Luckily, it is adaptable to the different distribution densities.

Figure 2. This is an improved neighbor relationship graph of Figure 1. (a) Directed neighbor relationships are generated on the nearest neighbors and the given coefficient

α

= 2. (b) The digraph is converted into an undigraph on mutual neighbors. Luckily, it is adaptable to the different distribution densities.

Figure 3. This is a graph for Lemma 2. (a)

M C C (I^{″}) \geq M C C (I^{'})

if the longest shortest path in

I^{'}

is

s p (i_{u}, i_{v})]

and

M C C (I^{″}) = C C (I^{″}, i_{u})

. (b)

M C C ({A 3, B 3, D 3}) \geq M C C ({A 3, B 3, C 3, D 3})

but

M C C ({A 3, B 3, C 3}) < M C C ({A 3, B 3, C 3, D 3})

.

Figure 3. This is a graph for Lemma 2. (a)

M C C (I^{″}) \geq M C C (I^{'})

if the longest shortest path in

I^{'}

is

s p (i_{u}, i_{v})]

and

M C C (I^{″}) = C C (I^{″}, i_{u})

. (b)

M C C ({A 3, B 3, D 3}) \geq M C C ({A 3, B 3, C 3, D 3})

but

M C C ({A 3, B 3, C 3}) < M C C ({A 3, B 3, C 3, D 3})

.

Figure 4. These are the distributions of the Synthetic data set and the Three Parallel Rivers data set. (a) A, B, and C are highly correlated to each other in all of the three density areas, and so are D, E, and F. (b) Different densities are distributed in the data sets. The default unit of distance metrics is meters in Figure 4a,b in addition to all distance thresholds in this section.

Figure 5. The precision, accuracy, and recall of different algorithms with some parameters on the Synthetic data set. The tests are based on

β

-CPM in comparison with RCMA and SGCT_K.

Figure 5. The precision, accuracy, and recall of different algorithms with some parameters on the Synthetic data set. The tests are based on

β

-CPM in comparison with RCMA and SGCT_K.

Figure 6. The distance distributions of neighbor pairs in the given data sets with

β

-CPM in comparison with SGCT_K. (a) The distances are lower and more differentiated with

β

-CPM than with SGCT_K in the Synthetic data set, (b) and so are they in the Three Parallel Rivers data set.

Figure 6. The distance distributions of neighbor pairs in the given data sets with

β

-CPM in comparison with SGCT_K. (a) The distances are lower and more differentiated with

β

-CPM than with SGCT_K in the Synthetic data set, (b) and so are they in the Three Parallel Rivers data set.

Figure 7. The dendrograms of type-

β

co-location pattern examples in the given data sets. (a) Since the closeness centrality of each feature is similar to each other in the given examples with

β

-CPM (

α

= 2,

β

= 0.3,

ζ

= 0.5), the closeness centrality of each feature is similar to each other. (b) The closeness centrality of f2 is obviously lower than f0, f1, and f2. It is a secondary contradiction in {f0, f1, f2, f3} with

β

-CPM (

α

= 3,

β

= 0.5, and

ζ

= 0.3). Furthermore, they are f3, f0, f1, and f2 in order of dominance in {f0, f1, f2, f3}.

Figure 7. The dendrograms of type-

β

co-location pattern examples in the given data sets. (a) Since the closeness centrality of each feature is similar to each other in the given examples with

β

-CPM (

α

= 2,

β

= 0.3,

ζ

= 0.5), the closeness centrality of each feature is similar to each other. (b) The closeness centrality of f2 is obviously lower than f0, f1, and f2. It is a secondary contradiction in {f0, f1, f2, f3} with

β

-CPM (

α

= 3,

β

= 0.5, and

ζ

= 0.3). Furthermore, they are f3, f0, f1, and f2 in order of dominance in {f0, f1, f2, f3}.

Table 1. The distributions of features and instances in the experimental data sets.

Data Sets	Feature Count	Instance Count	Distribution Densities
Synthetic data set	16	1515	Density ratios of 3:8:13
The Three Parallel Rivers data set	31	337	More different densities

Table 2. The total time costs (microseconds) in algorithms with ideal parameters.

Algorithms	Synthetic Data Set	The Three Parallel Rivers Data Sets
RCMA	3,492,173,294 ( $θ$ = 0.5, $ϵ$ = 0.5, $α$ = 0.05)	$1.133 \times 10^{11}$ ( $θ$ = 0.5, $ϵ$ = 0.2, $α$ = 0.05)
SGCT_K	6,408,119 (r = 12, $m i n_p r e K D E$ = 0.3)	1,129,484 (r = 8000, $m i n_p r e K D E$ = 0.3)
$β$ -CPM	208,859 ( $α$ = 2, $β$ = 0.3, $ζ$ = 0.5)	136,428 ( $α$ = 3, $β$ = 0.5, $ζ$ = 0.5)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, M.; Wang, L.; Wu, P.; Tran, V. Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets. ISPRS Int. J. Geo-Inf. 2022, 11, 418. https://doi.org/10.3390/ijgi11080418

AMA Style

Zou M, Wang L, Wu P, Tran V. Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets. ISPRS International Journal of Geo-Information. 2022; 11(8):418. https://doi.org/10.3390/ijgi11080418

Chicago/Turabian Style

Zou, Muquan, Lizhen Wang, Pingping Wu, and Vanha Tran. 2022. "Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets" ISPRS International Journal of Geo-Information 11, no. 8: 418. https://doi.org/10.3390/ijgi11080418

APA Style

Zou, M., Wang, L., Wu, P., & Tran, V. (2022). Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets. ISPRS International Journal of Geo-Information, 11(8), 418. https://doi.org/10.3390/ijgi11080418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets

Abstract

1. Introduction

1.1. Motivation

1.2. Overall Solution

2. Related Work

2.1. Traditional Definitions and Lemmas

2.2. Review

3. Spatial Mutual Neighbor Relationship Graph

3.1. Segmentation

3.2. Problem Statement

3.3. Generating Mutual Neighbor Relationship Graph on KD-Tree

4. Prevalence Check on Closeness Centrality

4.1. Definitions and Theorems

4.2. Type- $β$ Co-Location Pattern Mining

5. Experiment Analysis

5.1. Precision, Accuracy, and Recall

5.2. Efficiency

5.3. Density Response

5.4. Feature Closeness Centrality

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Mining Type-β Co-Location Patterns on Closeness Centrality in Spatial Data Sets

Abstract

1. Introduction

1.1. Motivation

1.2. Overall Solution

2. Related Work

2.1. Traditional Definitions and Lemmas

2.2. Review

3. Spatial Mutual Neighbor Relationship Graph

3.1. Segmentation

3.2. Problem Statement

3.3. Generating Mutual Neighbor Relationship Graph on KD-Tree

4. Prevalence Check on Closeness Centrality

4.1. Definitions and Theorems

4.2. Type- β Co-Location Pattern Mining

5. Experiment Analysis

5.1. Precision, Accuracy, and Recall

5.2. Efficiency

5.3. Density Response

5.4. Feature Closeness Centrality

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Type- $β$ Co-Location Pattern Mining