Persistence Symmetric Kernels for Classification: A Comparative Study

Cinzia Bandiziol; Stefano De Marchi

doi:10.3390/sym16091236

and

Dipartimento di Matematica “Tullio Levi-Civita”, University of Padova, 35121 Padova, Italy

^*

Authors to whom correspondence should be addressed.

Symmetry2024, 16(9), 1236;https://doi.org/10.3390/sym16091236

This article belongs to the Special Issue Algebraic Systems, Models and Applications

Version Notes

Order Reprints

Abstract

The aim of the present work is a comparative study of different persistence kernels applied to various classification problems. After some necessary preliminaries on homology and persistence diagrams, we introduce five different kernels that are then used to compare their performances of classification on various datasets. We also provide the Python codes for the reproducibility of results and, thanks to the symmetry of kernels, we can reduce the computational costs of the Gram matrices.

Keywords:

TDA; persistent homology; support vector machine; classification; kernel

1. Introduction

In the last two decades, with the increasing need to analyze large amounts of data, which are usually complex and of high dimension, it has been revealed as meaningful and helpful to discover further methodologies to provide new information from data. This has brought to birth Topological Data Analysis (TDA), whose aim is to extract intrinsic, topological features related to the so-called “shape of data”. Thanks to its main tool, Persistent Homology (PH), it can provide new qualitative information that would be impossible to extract in any other way. This kind of feature, which one can collect in the so-called Persistence Diagram (PD), have been gainful in many different applications, mainly related to applied science, improving the performances of models or classifiers, as in our context. Thanks to the strong basis of algebraic topology beneath it, the TDA is very versatile and can be applied to data with a priori any kind of structure, as we will explain in the following. This is the reason why there is a wide range of fields of applications, like chemistry [1], medicine [2], neuroscience [3,4], finance [5] and computer graphics [6], to name only a few.

An interesting and relevant property of this tool is its stability with regard to noise [7], which is a meaningful aspect for applications to real-world data. On the other hand, since the space of PDs is only metric, to use methods that require data to live in a Hilbert space, such as the SVM and PCA, it is necessary to introduce the notion of the kernel or, better still, the Persistence Kernel (PK), which maps PDs to space with more structure, where it is possible to apply techniques that need a proper definition of inner product. A relevant aspect to highlight is that PKs are symmetric, as usual, and, taking advantage of this symmetry, we may reduce the computational costs of the corresponding Gram matrices in our codes, since we need only to compute values onto the diagonal and below it.

In the literature, researchers have tested the PKs in the context of classification on some datasets but, to our knowledge, there is a lack of transversal analysis. For instance, the Persistence Scale-Space Kernel (PSSK) was introduced and tested in [8] on shape classification (SHREC14) and texture classification (OUTEX TC 00000), while in [9] the authors considered image classification (MNIST, HAM10000), classification of sets of points (PROTEIN) and shape classification (SHREC14, MPEG7). Ref. [10] presented the Persistence Weighted Gaussian Kernel (PWGK) and reported performances of kernels in trying to classify protein and synthetized data. In [11], the authors introduced their own kernel, the Sliced Wasserstein Kernel (SWK), and they compared it with other kernels for classification of 3D shapes, orbit recognition (linked twisted map) and texture classification (OUTEX00000). In [12], the Persistence Fisher Kernel (PFK) was tested for orbit recognition (linked twisted map) and shape classification (MPEG7). Persistence Image (PI) in [13] was used in orbit recognition (linked twisted map) and breast tumor classification [2]. Finally, in [14] there were comparisons of PWGK, SWK and PI in the context of graphs classification (PROTEIN, PTC, MUTAG, etc.…), and in [15] PSSK, PWGK, SWK, PFK were tested on classification related to Alzheimer disease, orbit recognition (linked twisted map) and classification of 3D shapes.

The goals of the present paper are as follows: first, we investigate how to choose values for parameters related to different kernels; then, we collect tools for computing PD starting from different kinds of data; finally, we compare the performances of the main kernels in the classification context. As far as we know, the content of this study is not already present in the literature.

The paper is organized as follows: in Section 2, we recall the basic notion related to persistent homology, the problem of classification and how to solve it using the Support Vector Machine (SVM) and we list the main PK available in the literature. Section 3 collects all the numerical tests that we have run, and in Section 4 we outline our conclusions.

2. Materials and Methods

2.1. Persistent Homology

This brief introduction does not claim to be exhaustive; therefore, we invite interested readers to refer, for instance, to the works [16,17,18,19,20,21]. The first ingredient needed is the concept of filtration. The most common choice in applications is to take into account a function

f : X \to R

, where

X

is a topological space that varies based on different contexts, and then to take into account the filtration based on the sub-level set given by

f^{- 1} (- \infty, a)

,

a \in R

. For example, such an f can be chosen as the distance function in the case of point cloud data, the gray-scale values at each pixel for images, the heat kernel signature for datasets as SHREC14 [22], the weight function of edges for graphs, and so on. We now recall the main theoretical results related to point cloud data, but all of them can be easily applied in other contexts.

We assume a set of points

X = {x_{k}}_{k = 1, \dots, m}

that we suppose to live in an open set of a manifold

M

. The aim is to be able to capture the relevant intrinsic properties of the manifold itself, and this is achieved through Persistent Homology (PH) being applied to such discrete information. To understand how PH has been introduced, first we have to mention simplicial homology, which represents the extension of homology theory to structures called simplicial complexes, that roughly speaking, are collections of simplices, glue togheter in a valid manner, as shown in Figure 1.

Figure 1. An example of a valid simplicial complex (left) and an invalid one (right).

Definition 1.

A simplicial complex K consists of a set of simplices of different dimensions and has to meet the following conditions:

Every face of a simplex σ in K must belong to K;
The non-empty intersection of any two simplices $σ_{1}, σ_{2} \in K$ is a face of both $σ_{1}$ and $σ_{2}$ .

The dimension of K is the maximum dimension of simplices that belong to K.

In application, data analysts usually compute the Vietoris–Rips complex.

Definition 2.

Let

(X, d)

denote a metric space from which the samples are taken. The Vietoris–Rips complex related to

X

, associated to the value of parameter ϵ, denoted by

V R (X, ϵ)

, is the simplicial complex whose vertex set is

X

, and

{x_{0}, \dots, x_{k}}

spans a k-simplex if and only if

d (x_{i}, x_{j}) ⩽ 2 ϵ

for all

0 ⩽ i, j ⩽ k

.

If

K : = V R (X, \bar{ϵ})

then we can divide all simplices of this set K into groups, based on their dimension k, and we can enumerate them using

Δ_{i}^{k}

. If

G = (Z, +)

is the well-known Abelian group, we may build linear combinations of simplices with coefficients in G, and so we introduce the following:

Definition 3.

An object of the form

c = \sum_{i} a_{i} Δ_{i}^{k}

with

a_{i} \in Z

is an integer-valued k-dimensional chain.

Linearity allows us to extend the previous definition to any subsets of simplices of K with dimension k.

Definition 4.

The group

C_{k}^{\bar{ϵ}} (X)

is called the group of k-dimensional simplicial integer-valued chains of the simplicial complex K.

It is then possible to associate with each simplicial complex the corresponding set of Abelian groups

C_{0}^{\bar{ϵ}} (X), \dots, C_{n}^{\bar{ϵ}} (X)

.

Definition 5.

The boundary

\partial Δ^{k}

of an oriented simplex

Δ^{k}

is the sum of all its

(k - 1)

-dimensional faces taken with a chosen orientation. More precisely,

\partial Δ^{k} = \sum_{i = 0}^{k} {(- 1)}^{k} Δ_{i}^{k - 1} .

In a general setting, we can extend the boundary operator by linearity to a general element of

C_{k}^{\bar{ϵ}} (X)

, obtaining a map

\partial_{k} : C_{k}^{\bar{ϵ}} (X) \to C_{k - 1}^{\bar{ϵ}} (X)

.

For any value of k,

\partial_{k}

is a linear map. Therefore, we can take into account its kernel: for instance, the group of k-cycles,

Z_{k}^{\bar{ϵ}} (X) : = ker (\partial_{k})

and the image, the group of k-boundaries,

B_{k + 1}^{\bar{ϵ}} (X) : = im (\partial_{k})

. Then,

H_{k}^{\bar{ϵ}} (X) = Z_{k}^{\bar{ϵ}} (X) / B_{k + 1}^{\bar{ϵ}} (X)

is the k-homology group and represents the k-dimensional holes that can be recovered from the simplicial structure. We briefly recall here that, for instance, zero-dimensional holes correspond to connected components, one-dimensional holes are cycles, and two-dimensional holes are cavities/voids. Since they are algebraic invariants, they collect qualitative information regarding the topology of the data. The most crucial aspect is highlighting the best value for

ϵ

to obtain a simplicial complex K that faithfully reproduces the original manifold’s topological structure. The answer is not straightforward and the process reveals instability; therefore, the PH analyzes not only one simplicial complex but a nested sequence of them, and, following the evolution of such a structure, it notes down the features that gradually emerge. From a theoretical point of view, letting

0 < ϵ_{1} < \dots < ϵ_{l}

be an increasing sequence of real numbers, we obtain the filtration

\emptyset \subset K_{1} \subset K_{2} \subset \dots \subset K_{l}

with

K_{i} = V R (X, ϵ_{i})

, and then

Definition 6.

The p-persistent homology group of

K_{i}

is the group defined as

H_{k}^{i, p} = Z_{k}^{i} / (B_{k}^{i + p} \cap Z_{k}^{i}) .

This group contains all stable homology classes in the interval i to

i + p

: they are born before the time/index i and are still alive after p steps. The persistent homology classes that are alive for large p correspond to stable topological features of

S

(see [23]). Along the filtration, the topological information appears and disappears; thus, it means that they may be represented with a couple of indexes. If p is such a feature, it must be born in some

K_{i}

and die in

K_{j}

, so it can be described as

(i, j)

,

i < j

. We underline here that j can be equal to

+ \infty

, since some features can be alive up to the end of the filtration. Hence, all such topological invariants live in the extended positive plane that here is denoted by

R_{+}^{2} = R_{\geq 0} \times {R_{\geq 0} \cup {+ \infty}}

. Another interesting aspect to highlight is that some features can appear more than once and, accordingly, such collections of points are called multisets. All of these observations are grouped into the following:

Definition 7.

A Persistence Diagram (PD) D_r

(X, ε)

related to the filtration

\emptyset \subset K_{1} \subset K_{2} \subset \dots \subset K_{l}

with

ε : = (ϵ_{1}, \dots, ϵ_{l})

is a multiset of points defined as

D_{r} (X, ε) : = {(b, d) | (b, d) \in P_{r} (X, ε)} \cup Δ

where

P_{r} (X, ε)

denotes the set of r-dimensional birth–death couples that came out along the filtration, each

(b, d)

is considered with its multiplicity, while points of

Δ = {(x, x) | x \geq 0}

with infinite multiplicity. One may consider all

P_{r} (X, ε)

for every r together, obtaining the total PD denoted here by

D (X, ε)

, which we will usually consider in the following sections.

Each point

(b, d) \in D_{r} (X, ε)

is known as a generator of the persistent homology, and it corresponds to a topological feature that is born at

K_{b}

and dies at

K_{d}

. The difference

d - b

is called the persistence of the generator, which represents its lifespan and shows the robustness of the topological property.

Figure 2 is an example of a total PD collecting features of zero dimensions (in blue), one dimension (in orange), and two dimensions (in green). Points close to the diagonal represent features with a short lifetime, and so, usually, they are concerned with noise, while features far away are also relevant and meaningful and, based on applications, one can decide to consider both or only the most interesting ones. At the top of the Figure, there is a dashed line that indicates infinity and also allows us to plot couples as

(i, + \infty)

.

Figure 2. Example of PD with features of zero, one and two dimensions.

In the previous definition, the set

Δ

is added to finding out proper bijections between sets that without

Δ

could not have the same number of points. This makes it possible to compute the proper distance between PDs.

Stability

A key property of PDs is stability under perturbation of the data. First, we recall two famous distances for sets:

Definition 8.

Given two non-empty sets

X, Y \subset R^{d}

with equal cardinality, the Haussdorff distance is

d_{H} (X, Y) : = max {sup_{x \in X} inf_{y \in Y} {∥ x - y ∥}_{\infty}, sup_{y \in Y} inf_{x \in X} {∥ y - x ∥}_{\infty}}

and the bottleneck distance is defined as

d_{B} (X, Y) : = inf_{γ} sup_{x \in X} {∥ x - γ (x) ∥}_{\infty}

(1)

where we consider all possible bijection of multisets

γ : X \to Y

. Here, we use

{∥ v - w ∥}_{\infty} = max {| v_{1} - w_{1} |, | v_{2} - w_{2} |}, f o r v = (v_{1}, v_{2}), w = (w_{1}, w_{2}) \in R^{2} .

We will now try to better explain how to compute the bottleneck distance. We have to take all possible ways to move points from

X

to

Y

in a bijective manner; then, we can compute properly the distance. Figure 3 shows two different PDs overlapped that consist of

Δ

joined with 2 points in red and 11 points in blue, respectively. First, in order to apply definition (1), we need two sets with the same cardinality. For this aim, it is necessary to add points of

Δ

—more precisely, points of the diagonal obtained, projecting in an orthogonal manner 9 blue points closer to it, to reach 11. The lines between the points and

Δ

represent the bijection that realizes the best matching between the points in definition (1).

Figure 3. Example of bottleneck distance between two PDs in red and blue.

Proposition 1.

Let

X

and

Y

be a finite subset in a metric space

(M, d_{M})

. Then, two Persistence Diagrams,

D (X, ε)

,

D (Y, ε)

, satisfy

d_{B} (D (X, ε), D (Y, ε)) ⩽ d_{H} (X, Y) .

For any further details, see, for example, in [17].

2.2. Classification with SVM

Let

Ω \subset R^{d}

and

{x_{1}, \dots, x_{m}} \subset X \subset Ω

be the set of input data with

d, m \in N .

We have a training set composed of the couples

(x_{i}, y_{i})

with

i = 1, \dots, m

and

y_{i} \in Y = {- 1, 1}

. The binary supervised learning task consists in finding a function

f : Ω \to Y

, the model, such that it can predict satisfactorily the label of an unseen

\tilde{x} \in Ω ∖ X

.

The goal is to define a hyperplane that can separate, in the best possible way, points that belong to different classes and, from here, name the separating hyperplane. The best possible way means that it separates the two classes with the higher margin—that is, the distance between the hyperplane and the points of both classes.

More formally, if we assume that we are in a space

F

with a dot product—for instance,

F

can be a subset of

R^{d}

with

⟨ \cdot, \cdot ⟩

, since a generic hyperplane can be defined as

{x \in F | ⟨ w, x ⟩ + b = 0} w \in F, b \in R

—then we can introduce the following definition.

Definition 9.

We call

ρ_{w, b} (x, y) : = \frac{y (⟨ w, x ⟩ + b)}{∥ w ∥}

the geometrical margin of the point

(x, y) \in F \times {- 1, 1}

. The minimum value,

ρ_{w, b} : = min_{i = 1, \dots, m} ρ_{w, b} (x_{i}, y_{i})

may be called a geometrical margin of

(x_{1}, y_{1}), \dots, (x_{m}, y_{m})

.

From a geometrical perspective, this margin measures effectively the distance between samples and the hyperplane itself. Then, the SVM is looking for a suitable hyperplane that intuitively realizes the maximum of such a margin. For any further details, see, for example, in [24]. The precise formalization brings us to an optimization problem that, thanks to the Lagrange multipliers and the Karush–Kuhn–Tucker conditions, turns out to have the following formulation, as an SVM optimization problem:

\begin{matrix} max_{α \in R^{m}} & \sum_{i = 1}^{m} α_{i} - \frac{1}{2} \sum_{i, j = 1}^{m} α_{i} α_{j} y_{i} y_{j} ⟨ x_{i}, x_{j} ⟩ \\ s . to & \sum_{i = 1}^{m} α_{i} y_{i} = 0 \\ 0 \leq α_{i} \leq C \forall i = 1, \dots, m \end{matrix}

where

{[0, C]}^{m}

is the bounding box

C \in [0, + \infty)

and

α_{i} > 0

are called the support vectors. Henceforth, the name Support Vector Machine is shortened to SVM, and

⟨ \cdot, \cdot ⟩

denotes the inner product in

R^{d}

. This formulation can face satisfactorily the classification task if the data are linearly separable. In applications, this does not happen frequently, and so it is necessary to introduce some nonlinearity and to move in a higher-dimensional space where, hopefully, this can happen. This can be achieved with the use of kernels. Starting from the original dataset

X

, the theory tells us to introduce a feature map

Φ : X \to H

that moves data from

X

to a Hilbert space of function

H

: the so-called feature space. The kernel is then defined as

κ (x, \bar{x}) : = {⟨ Φ (x), Φ (\bar{x}) ⟩}_{H}

(kernel trick). Thus, the optimization problem becomes

\begin{matrix} max_{α \in R^{m}} & \sum_{i = 1}^{m} α_{i} - \frac{1}{2} \sum_{i, j = 1}^{m} α_{i} α_{j} y_{i} y_{j} κ (x_{i}, x_{j}) \\ s . to & \sum_{i = 1}^{m} α_{i} y_{i} = 0 \\ 0 \leq α_{i} \leq C \forall i = 1, \dots, m \end{matrix}

where the kernel represents a generalization of the inner product in

R^{d}

. We are interested in classifying the PDs and, obviously, we need suitable definitions for the kernels for the PDs, the so-called Persistence Kernels (PK).

2.3. Persistence Kernels

In what follows, we denote with

D

the set of the total PDs.

2.3.1. Persistence Scale-Space Kernel (PSSK)

The first kernel was described in [8]. The main idea is to compute the feature map as the solution of the Heat equation. We consider

Ω_{a d} = {x = (x_{1}, x_{2}) \in R^{2} : x_{2} ⩾ x_{1}}

and we denote with

δ_{x}

the Dirac delta with its center at x. If

D \in D

, we take into account the solution

u : Ω_{a d} \times R_{⩾ 0} \to R

,

(x, t) \mapsto u (x, t)

of the following PDE:

\begin{matrix} Δ_{x} u = \partial_{t} u & in Ω_{a d} \times R_{⩾ 0} \\ u = 0 & on \partial Ω_{a d} \times R_{⩾ 0} \\ u = \sum_{y \in D} δ_{y} & on Ω_{a d} \times 0 . \end{matrix}

The feature map

Φ_{σ} : D \to L^{2} (Ω_{a d})

at scale

σ > 0

at D is defined as

Φ_{σ} (D) = u |_{t = σ}

. This map yields the Persistence Scale-Space Kernel (PSSK)

K_{P S S}

on

D

as

K_{P S S} (D, E) = {⟨ Φ_{σ} (D), Φ_{σ} (E) ⟩}_{L^{2} (Ω_{a d})} .

But, since it is known as an explicit formula for the solution u, the kernel takes the form

K_{P S S} (D, E) = \frac{1}{8 π σ} \sum_{x \in D, y \in E} exp (- \frac{{∥ x - y ∥}^{2}}{8 σ}) - exp (- \frac{∥ x - \bar{y} ∥^{2}}{8 σ})

where

y = (a, b), \bar{y} = (b, a)

for any

D, E \in D

.

2.3.2. Persistence Weighted Gaussian Kernel (PWGK)

In [10], the authors introduced a new kernel, whose idea is to replace each PD with a discrete measure. Starting with a strictly positive definite kernel—as, for example, the Gaussian one

κ_{G} (x, y) = e^{- \frac{{∥ x - y ∥}^{2}}{2 ρ^{2}}}

,

ρ > 0

—we indicate the corresponding Reproducing Kernel Hilbert Space

H_{κ_{G}}

.

If

Ω \subset R^{d}

, we denote with

M_{b} (Ω)

the space of finite signed Radon measures and

E_{κ_{G}} : M_{b} (Ω) \to H_{κ_{G}}, μ \mapsto \int_{Ω} κ_{G} (\cdot, x) d μ (x) .

For any

D \in D

, if

μ_{D}^{w} = \sum_{x \in D} w (x) δ_{x}

, where the weight function satisfies

w (x) > 0

for all

x \in D

, then

E_{κ_{G}} (μ_{D}^{w}) = \sum_{x \in D} w (x) κ_{G} (\cdot, x)

where

w (x) = arctan (C_{w} p e r s {(x)}^{p})

and

p e r s (x) = x_{2} - x_{1}

.

The Persistence Weight Gaussian Kernel (PWGK) is defined as

K_{P W G} (D, E) = exp (- \frac{1}{2 τ^{2}} {∥ E_{κ_{G}} (μ_{D}^{w}) - E_{κ_{G}} (μ_{E}^{w}) ∥}_{H_{κ_{G}}}^{2}), τ > 0

for any

D, E \in D

.

2.3.3. Sliced Wasserstein Kernel (SWK)

Another possible choice for

κ

was introduced in [11].

If

μ

and

ν

are two non-negative measures on

R

, such that

μ (R) = r = | μ |

and

ν (R) = r = | ν |

, then we recall that the 1-Wasserstein distance for non-negative measures is defined as

W (μ, ν) = inf_{P \in Π (μ, ν)} \int \int_{R \times R} | x - y | d P (x, y)

where

Π (μ, ν)

is the set of measures on

R^{2}

with marginals

μ

and

ν

.

Definition 10.

If

θ \in R^{2}

with

{∥ θ ∥}_{2} = 1

, let

L (θ)

indicate the line

{λ θ | λ \in R}

and let

π_{θ} : R^{2} \to L (θ)

be the orthogonal projection onto

L (θ)

. Let

D, E \in D

and let

μ_{D}^{θ} : = \sum_{x \in D} δ_{π_{θ} (x)}

and

μ_{D Δ}^{θ} : = \sum_{x \in D} δ_{π_{θ} \circ π_{Δ} (x)}

and similarly for

μ_{E}^{θ}

and

μ_{E Δ}^{θ}

, where

π_{Δ}

denotes the orthogonal projection onto the diagonal. Then, the Sliced Wasserstein distance is

S W (D, E) = \frac{1}{2 π} \int_{S_{1}} W (μ_{D}^{θ} + μ_{E Δ}^{θ}, μ_{E}^{θ} + μ_{D Δ}^{θ}) d θ .

Thus, the Sliced Wasserstein Kernel (SWK) is defined as

K_{S W} (D, E) : = exp (- \frac{S W (D, E)}{2 η^{2}}), η > 0

for any

D, E \in D

.

2.3.4. Persistence Fisher Kernel (PFK)

In [12], the authors described a kernel based on Fisher Information geometry.

Given a persistence diagram

D \in D

, it is possible to build a discrete measure

μ_{D} = \sum_{u \in D} δ_{u}

, where

δ_{u}

is Dirac’s delta centered in u. Given a bandwidth

σ > 0

and a set

Θ

, one can smooth and normalize

μ_{D}

as follows:

ρ_{D} : = \frac{1}{Z} \sum_{u \in D} N (x; u, σ I)

where N is a Gaussian function,

Z = \int_{θ} \sum_{u \in D} N (x; u, σ I) d x

and I is the identity matrix. Thus, using this measure, any PD can be regarded as a point in

P = {ρ | \int ρ (x) d x = 1, ρ (x) \geq 0}

.

Given the two elements in

ρ_{i}, ρ_{j} \in P

, the Fisher Information Metric is

d_{P} (ρ_{i}, ρ_{j}) = arccos (\int \sqrt{ρ_{i} (x) ρ_{j} (x)} d x) .

Inspired by the Sliced Wasserstein Kernel construction, we have the following definition.

Definition 11.

Given two finite and bounded persistence diagrams

D, E

, the Fisher Information Metric between D and E is defined as

d_{F I M} (D, E) : = d_{P} (ρ_{D \cup E_{Δ}}, ρ_{E \cup D_{Δ}})

where

D_{Δ} : = {Π_{Δ} (u) | u \in D}

,

E_{Δ} : = {Π_{Δ} (u) | u \in E}

and

Π_{Δ}

is the orthogonal projection on the diagonal

Δ = {(a, a) | a \geq 0}

.

The Persistence Fisher Kernel (PFK) is then defined as

K_{P F} (D, E) : = exp (- t d_{F I M} (D, E)), t > 0, for any D, E \in D .

2.3.5. Persistence Image (PI)

The main reference is [13]. If

D \in D

then we introduce a change of coordinates,

T : R^{2} \to R^{2}

given by

T (x, y) = (x, y - x)

and we let

T (D)

be the multiset made by the first-persistence coordinates. Let

ϕ_{u} : R^{2} \to R

be a differentiable probability distribution with mean

u = (u_{x}, u_{y}) \in R^{2}

, usually

ϕ_{u} = g_{u}

, where

g_{u}

is the two-dimensional Gaussian with mean u and variance

σ^{2}

, defined as

g_{u} (x, y) = \frac{1}{2 π σ^{2}} e^{- [{(x - u_{x})}^{2} + {(y - u_{y})}^{2}] / 2 σ^{2}} .

Fix a weight function

f : R^{2} \to R

, where

f \geq 0

, which is equal to zero on the horizontal axis, continuous and piecewise differentiable. A possible choice is a function that depends only on the persistence coordinate y, a function

f (x, y) = w_{b} (y)

where

w_{b} (t) = \{\begin{matrix} 0 & if t \leq 0, \\ \frac{t}{b} & if 0 < t < b, \\ 1 & if t \geq b . \end{matrix}

Definition 12.

Given

D \in D

, the corresponding persistence surface

ρ_{D} : R^{2} \to R

is the function

ρ_{D} (x, y) = \sum_{u \in T (D)} f (u) ϕ_{u} (x, y) .

If we divide the plane into a grid with

n^{2}

pixels

{(P_{i, j})}_{i, j = 1, \dots, n}

then we have the following definition.

Definition 13.

Given

D \in D

, its persistence image is the collection of pixels

P I {(ρ_{D})}_{i, j} = \int \int_{P_{i, j}} ρ_{D} (x, y) d x d y .

Thus, through the persistence image, each persistence diagram is turned into a vector

P I V \in R^{n^{2}}

that is

P I V {(D)}_{i + n (j - 1)} = P I {(D)}_{i, j}

; then, it is possible to introduce the following kernel:

K_{P I} (D, E) = < P I V (D), P I V (E) >_{R^{n^{2}}} .

3. Results

3.1. Shape Parameters Analysis

From the definitions of the aforementioned kernels, it is evident that each of them depends on some parameters, and it is not clear, at present, which values are assigned to them. The cross-validation phase—or, for abbreviation, the CV phase—tries to answer to this question. For instance, in the CV phase the user defines a set of values for the parameters, and for any possible admissible choice he solves the classification task and measures the goodness of the obtained results. It is well-known that in the RBF interpolation literature [25] the shape parameters have to be chosen after a process similar to the CV phase. In the beginning, the user chooses some values for each parameter, and then, varying them, one checks the condition number and the interpolation error, noting how they vary according to the values of the parameters. The trade-off principle suggests considering values where the condition number is not huge (ill-conditioned) and the interpolation error is too small (accuracy). Now, in the context of classification, we have to replace the concepts of the condition number of the interpolation matrix and the interpolation error. We can achieve this goal by considering the condition number of the Gram matrix and the accuracy of the classifier. To obtain a good classifier, it is desirable to have a small condition number of the Gram matrix and high accuracy, as close to 1 as possible. The aim here was to run such an analysis for the kernels presented in this paper.

The PSSK has only one parameter to tune:

σ

. Typically, the users consider

σ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}

. We ran the CV phase for different shuffles of a dataset and plotted the results in terms of the condition number of the Gram matrix related to the training samples and the accuracy. For our analysis, we considered

σ \in {0.00001, 0.0001, 0.001,

0.01, 0.1, 1, 10, 100, 500, 800, 1000}

and we ran tests on some datasets cited in the following. The results were similar in each case, so we decided to report those for the SHREC14 dataset.

From the Figure 4, it is evident that large values of

σ

result in an unstable matrix and less accuracy. Then, in what follows, we will take into account only

σ \in {0.00001, 0.0001,

0.001, 0.01, 0.1, 1, 10}

.

Figure 4. Comparison results about PSSK for SHREC14, in terms of condition number (left) and accuracy (right) with different

σ

.

The PWGK is the kernel with a higher number of parameters to tune; therefore, it was not so evident what were the best-set values to take into account. We chose reasonable starting sets as follows:

τ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}

,

ρ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}

,

p \in {1, 5, 10, 50, 100}

,

C_{w} \in {0.001, 0.01, 0.1, 1}

. Due to a large number of parameters, we first ran some experiments varying

(ρ, τ)

with fixed

(p, C_{w})

, and then we reversed the roles.

We report here in Figure 5 only a plot for fixed

C_{w}

and p because this highlights how high values of

τ

(for example

τ = 1000

) were excluded. We found this behavior for different values of

C_{w}, p

and various datasets—here, for the case

C_{w} = 1

,

p = 10

and the MUTAG dataset. Therefore, we decided to vary the parameters, as follows:

τ \in {0.001, 0.01, 0.1, 1, 10, 100}

,

ρ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}

,

p \in {1, 5, 10, 50, 100}

,

C_{w} \in {0.001, 0.01, 0.1, 1}

. Unfortunately, there was no other evidence that could guide the choices, except for

τ

, where values

τ = 1000

always had bad accuracy, as one can see below in the case of MUTAG with the shortest path distance.

Figure 5. Comparison results about the PWGK for MUTAG, in terms of accuracy with different

τ

and

ρ

.

In the case of the SWK, there is only one parameter,

η

. In [11], the authors proposed to consider values starting from the first and last decile and with the median value of the gram matrix of the training samples flattened, in order to obtain a vector; then, they multiplied these three values for

0.01, 0.1, 1, 10, 100

.

For our analysis, we decided to study the behavior of such kernels, considering the same set of values independently from the specific dataset. We considered

η \in {0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500, 800, 1000}

.

We ran tests on some datasets, and the plot, related to the DHFR dataset, revealed evidently that large values for

η

were to be excluded, as suggested by Figure 6. So, we decided to take

η

only in

{0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10}

.

Figure 6. Comparison results about the SWK for DHFR, in terms of condition number (left) and accuracy (right) with different

η

.

The PFK has two parameters: the variance

σ

and t. In [12], the authors exhibited the procedure to follow, in order to obtain the corresponding set of values. It shows that the choice of t depends on

σ

. Our aim in this paper was to carry out an analysis that was dataset-independent, which turned out to be strictly connected only to the definition of the kernel itself. First, we took different values for

(σ, t)

and we plotted the corresponding accuracies—here, in the case of MUTAG with the shortest path distance, but the same behavior holds true also for other datasets.

The condition numbers were indeed high for every choice of parameters and, therefore, we avoided reporting here, because it would have been meaningless. From the Figure 7, it is evident that it is convenient to set

σ

lower or equal to 10, while t should be set larger or equal to 0.1. Thus, in what follows, we took into account

σ \in {0.001, 0.01, 0.1, 1, 10}

and

t \in {0.1, 1, 10, 100, 1000}

.

Figure 7. Comparison results about the PFK for MUTAG, in terms of accuracy with different t and

σ

.

In the case of the PI, we considered a reasonable set of values for the parameter

σ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}

. The results were related to BZR with the shortest path distance redand shown in Figure 8.

Figure 8. Comparison results about the PI for BZR, in terms of condition number (left) and accuracy (right) with different

σ

.

As in the previous kernels, it seemed that the accuracy was better for small values of

σ

. For this reason, we set

σ \in {0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10}

.

3.2. Numerical Tests

For what concerned the computation of simplicial complexes and persistence diagrams, we used some Python libraries available online: gudhi [26], ripser [27], giotto-tda [28] and persim [29]. On all the datasets, we performed a random splitting

(70 % / 30 %)

for training and testing, and we applied a tenfold cross-validation on the training set, in order to tune the parameters. Then, we averaged the results over 10 runs. For balanced datasets, we measured the performances of the classifier through accuracy for binary and multiclass problems:

accuracy = \frac{number of test samples correctly classify}{all test samples}

In the case of the imbalanced datasets, we adopted balanced accuracy, as explained in [30]: if for every class i we defined the related recall as

r e c a l l_{i} = \frac{test samples of class i correctly classify}{all test samples of class i}

then the balanced accuracy in the case of n different classes was

balanced_accuracy = \frac{\sum_{i = 1}^{n} r e c a l l_{i}}{n}

This definition was able to effectively quantify how accurate the classifier was, even in the case of the smallest classes. For the tests, we used the implementation of the SVM provided by the Scikit [31] library of Python. For PFK, we precomputed the Gram matrices using a Matlab (Matlab R2023b) routine because it is faster than the Python one. The values for C belonged to

{0.001, 0.01, 0.1, 1, 10, 100}

. For each kernel, we considered the following values for the parameters:

PSSK: $σ \in {0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10}$ .
PWGK: $τ \in {0.001, 0.01, 0.1, 1, 10, 100}$ , $ρ \in {0.001, 0.01, 0.1, 1, 10, 100, 1000}$ , $p \in {1, 5, 10, 50, 100}$ , $C_{w} \in {0.001, 0.01, 0.1, 1}$ , and for the kernel we chose the Gaussian one.
SWK: $η \in {0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10}$ .
PFK: $σ \in {0.001, 0.01, 0.1, 1, 10}$ and $t \in {0.1, 1, 10, 100, 1000}$ .
PI: $σ \in {0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10}$ and number of pixel 0.1.

All the codes were run using Python 3.11 on a 2.5 GHz Dual-Core Intel Core i5, 32 Giga RAM. They can be found and downloaded from the GitHub page https://github.com/cinziabandiziol/persistence_kernels (accessed on 1 August 2024).

3.3. Point Cloud Data and Shapes

3.3.1. Protein

This is the Protein Classification Benchmark dataset PCB00019 [32]. It sums up information for 1357 proteins related to 55 classification problems. The data were highly imbalanced and therefore we applied the classifier to one of them, whereby the imbalance was slightly less evident. Persistence diagrams were computed for each protein by considering the 3-D structure or, better still, the

(x, y, z)

position of any atoms in each of the 1357 molecules, as a point cloud in

R^{3}

. Finally, using ripser we computed the persistence diagrams of only one dimension.

3.3.2. SHREC14—Synthetic Data

This dataset is related to the problem of non-rigid 3D shape retrieval. It collects exclusively human models in different body shapes and 20 poses, some examples are reported in Figure 9. It consists of 15 different human models, including man, woman, and child, each with its own body shape. Each of these models exists in 20 different poses, making up a dataset composed of 300 models.

Figure 9. Some elements of the SHREC14 dataset.

For each shape, the meshes are given with about 60,000 vertices and, using the Heat Kernel Signature (HKS) introduced in [33], over different values of

t_{i}

as [8], we computed the persistence diagrams of the induced filtrations in dimensions 1.

3.3.3. Orbit Recognition

We considered the dataset proposed in [13]. We took into account the linked twisted map, which modeled the fluid flows. The orbits could then be computed through the following discrete dynamical system:

\{\begin{matrix} x_{n + 1} & = x_{n} + r y_{n} (1 - y_{n}) \mod 1 \\ y_{n + 1} & = y_{n} + r x_{n + 1} (1 - x_{n + 1}) \mod 1 \end{matrix}

with the starting point

(x_{0}, y_{0}) \in [0, 1] \times [0, 1]

and

r > 0

being a real number that influenced the behavior of the orbits, as shown in Figure 10.

Figure 10. Orbits composed by the first 1000 iterations of the twisted map with

r = 3.5, 4.1, 4.3

from left to right, starting from the fixed random

(x_{0}, y_{0}) \in {[0, 1]}^{2}

.

As in [13],

r \in 2.5, 3.5, 4, 4.1, 4.3

, and it was strictly connected to the label of the corresponding orbit. For each of them, we provided the first 1000 points of 50 orbits, with starting points chosen randomly. The final dataset was composed of 250 elements. We computed the PDs, considering only the one-dimensional features. Since each PD had a huge number of topological features, we decided to consider only the first 10 most persistent ones, as in [15].

Firstly, the great difference in performances among the different datasets was probably due to the high imbalance of the PROTEIN one, with respect to the perfect balance of the other ones. It is well-known that if the classifier does not have enough samples for each class, as in the case of the imbalanced dataset, it has to face significant issues in classifying correctly the elements of the minor classes. From Table 1 it is evident how, except for PROTEIN, where the PSSK showed slightly better performances, for SHREC14 and DYN SYS the best accuracy was achieved by the SWK.

Table 1. Accuracy related to point cloud and shape datasets (Balanced Accuracy only for the PROTEIN dataset). The best results are underlined in bold.

3.4. Images

All the definitions introduced in Section 2 can be extended to another kind of simplicial complex, the cubical complex. This is useful when one deals with images or objects based on meshes, for example. More details can be found in [34].

Definition 14.

An elementary cube

Q \subset R^{d}

can be defined as the product of

I_{1}

,

Q = I_{1} \times \dots \times I_{d}

, where each

I_{j}

can be either a set with one element

{m}

or a unit-length interval

[m; m + 1]

for some

m \in Z .

A k-cube Q is a cube whose number of unit-length intervals in the product of Q is equal to k, which is then defined as the dimension of the cube Q. If

Q_{1}

and

Q_{2}

are cubes and

Q_{1} \subset Q_{2}

then

Q_{1}

is called a face of

\bar{Q}

. A cubical complex X in

R^{d}

is a collection of k-cubes

(0 \leq k \leq d)

, such that:

every face of a cube in X has to belong to X;
given two cubes of X, their intersection must be either empty or a face of each of them.

The lower dimensional cubical complexes are reported in Figure 11.

Figure 11. Cubical simplices.

MNIST and FMNIST

MNIST [35] is very common in the classification framework. It consists of 70,000 handwritten digits, in grayscale as one can see in Figure 12, which one can try to classify into 10 different classes. Each image can be viewed as a set of pixels with a value between 0 and 256 (black and white), as in the figure below:

Figure 12. Example of an element in the MNIST dataset.

Starting from this kind of dataset, we have to compute the corresponding persistent features. According to the approach proposed in [36] coming from [9], we first binarize each image: for instance, we replace each grayscale image with a white/black one, then we use as a filtration function the so-called Height filtration

H (p)

in [36]. For a cubical complex, for a chosen vector

v \in R^{d}

of unit norm, it is defined as

H (p) = \{\begin{matrix} ⟨ p, v ⟩ & if p is black, \\ H_{\infty} & otherwise \end{matrix}

where

H_{\infty}

is a large default value chosen by the user. As in [9], we choose four different vectors for p,

(1, 0), (- 1, 0), (0, 1), (0, - 1)

, and we compute zero and one-dimensional persistent features, using both the tda-giotto and the gudhi libraries. Finally, we concatenate them. For the current experiment, we decided to focus the test on a subset of the original MNIST, composed of only 10,000 samples. This was a balanced dataset. Due to some memory issues, we had to consider for this dataset a pixel size of

0.5

and for the PWGK only

τ \in {0.001, 0.01, 0.1, 1, 10, 100}

,

ρ \in {0.001, 0.1, 10, 1000}

,

p = 10

,

C_{w} \in {0.001, 0.01, 0.1, 1}

.

Another example of a grayscale image dataset is the FMNIST [37], which contains 28 × 28 grayscale images related to the fashion world. Figure 13 shows an example.

Figure 13. Example of an element in the FMNIST dataset.

To deal with this, we followed another approach proposed in [23], where the authors applied padding, median filter, shallow thresholding and canny edges and then computed the usual filtration to the image obtained. Due to some memory issues, we had to consider for this dataset a pixel size of 1 and for the PWGK only

τ \in {0.001, 0.01, 0.1, 1, 10, 100}

,

ρ \in {0.001, 0.1, 10, 1000}

,

p = 10

,

C_{w} \in {0.001, 0.01, 0.1, 1}

.

Both datasets were balanced, and it is probable that the results were better in the case of MNIST, due the fact that it is easier to classify handwritten digits instead of images of cloths. The SWK showed a slightly better performance as reported in Table 2.

Table 2. Accuracy related to MNIST and FMNIST. The best results are underlined in bold.

3.5. Graphs

In many different contexts, from medicine to chemistry, data can have the structure of graphs. Graphs are couples of a set

(V, E)

, where V is the set of vertices and E is the set of edges. The graph classification is the task of attaching a label/class to each whole graph. In order to compute the persistent features, we needed to build a filtration. In the context of graphs, as in other cases, there are different definitions; see, for example, in [38].

We considered the Vietoris–Rips filtration, where, starting from the set of vertices, at each step we would add the corresponding edge whose weights were less or equal to a current value

ϵ

. This turned out to be the most common choice, and the software available online allowed us to build it after providing the corresponding adjacency matrix. In our experiments, we considered only undirected graphs, but, as in [38], building a filtration is possible also for directed graphs. Once defining the kind of filtration to use, one needs again to choose the corresponding weights. We decided to take into account first the shortest path distance and then the Jaccard index, as, for example, in [14].

Given two vertices

u, v \in V

the shortest path distance was defined as the minimum number of different edges that one has to meet going from u to v or vice versa, since the graphs here were considered as undirected. In graphs theory, this is a widely use metric.

The Jaccard index is a good measure of edge similarity. Given an edge

e = (u, v) \in E

then the corresponding Jaccard index is computed as

ρ (u, v) = | \frac{N B (u) \cap N B (v)}{N B (u) \cup N B (v)} |

where

N B (u)

is the set of neighbors of u in the graph. This metric recovers the local information of nodes, in the sense that two nodes are considered similar if their neighbor sets are similar.

In both cases, we considered the sub-level set filtration and we collected both zero- and one-dimensional persistent features.

We took six of such sets among the graph benchmark datasets, all undirected, as follows:

MUTAG: a collection of nitroaromatic compounds, the goal being to predict their mutagenicity on Salmonella typhimurium;
PTC: a collection of chemical compounds represented as graphs that report the carcinogenicity of rats;
BZR: a collection of chemical compounds that one has to classify as active or inactive;
ENZYMES: a dataset of protein tertiary structures obtained from the BRENDA enzyme database; the aim is to classify each graph into six enzymes;
DHFR: a collection of chemical compounds that one has to classify as active or inactive;
PROTEINS: in each graph, nodes represent the secondary structure elements; the task is to predict whether or not a protein is an enzyme.

The properties of the above are summarized in Table 3, where the IR index is the so-called Imbalanced Ratio (IR) that denotes the imbalance of the dataset, and it is defined as a sample size of the major class over a sample size of the minor class.

Table 3. Graph datasets.

The computations of the adjacency matrix and the PDs were made using the functions implemented in tda-giotto.

The performances achieved with the two edge weights are reported in Table 4 and Table 5.

Table 4. Balanced Accuracy related to graph datasets using the shortest path distance (for the ENZYMES dataset, Accuracy only). The best results are underlined in bold.

Table 5. Balanced Accuracy related to graph datasets using the Jaccard Index (for the ENZYMES dataset, Accuracy only) The best results are underlined in bold.

Thanks to these results, two conclusions can be reached. The first one is that, as expected, the goodness of the classifier is strictly related to the particular filtration used for the computation of persistent features. The second one is related to the fact that the SWK and the PFK seem to work slightly better than the other kernels: in the case of the shortest path distance in Table 4, the SWK is to be preferred while the PFK seems to work better in the case of the Jaccard index Table 5. In the case of PROTEINS, in both cases the PWGK provides the best Balanced Accuracy.

3.6. One-Dimensional Time Series

In many different applications, one can deal with one-dimensional time series. A one-dimensional time series is a set

{x_{t} \in R | t = 1, \dots, T}

. In [39], the authors provided different approaches to building a filtration upon this kind of data. We decided to adopt the most common one. Thanks to the Taken’s embedding, these data could be translated into point clouds. With suitable choices for two parameters—

τ > 0

for the delay parameter and

d > 0

for the dimension—it was possible to compute a subset of points in

R^{d}

composed by

v_{i} = {x_{i}, x_{i + τ}, \dots, x_{i + (d - 1) τ}}

for

i = 1, \dots, T - (d - 1) τ

. The theory mentioned above related to point clouds could then be applied to signals, as points in

R^{d}

. For how to choose values for the parameters, see [39]. The datasets for the tests, in Table 6, were taken from the UCR Time Series Classification Archive (2018) [40], which consists of 128 datasets of time series from different worlds of application. In the archive, there is a split into test and train sets, but for the aim of our analysis we did not regard the split; we considered the train and test data as a whole dataset and then the codes provided properly the subdivision.

Table 6. Time series datasets.

Using giotto-tda, we computed the persistent features of zero, one and two dimensions and joined them together. The final results of the datasets are reported in Table 7.

Table 7. Balanced Accuracy related to time series datasets. The best results are underlined in bold.

As in the previous examples, the SWK achieved the best results and provided slightly better performances, in terms of accuracy.

4. Discussion

In this paper, we compared the performances of five Persistent Kernels applied to data of different natures. The results show how different PKs are indeed comparable, in terms of accuracy, and that there was no one PK that emerged clearly above the others. However, in many cases, the SWK and PFK performed slightly better. In addition, from a purely computational point of view, the SWK is to be preferred, since by construction the preGram matrix is parameter-independent. Therefore, in practice, the user has to compute such a matrix on the whole dataset only once at the beginning and then choose a suitable subset of rows and columns to perform the training, cross-validation and test phases. This aspect is relevant and reduces the computational costs and time compared with other kernels. Another aspect to be considered, as in the case of graphs, is how to choose the function f that provides the filtration. The choice of such a function is still an open problem and an interesting field of research. The right choice, in fact, would guarantee being able to better extract the intrinsic information from data, thus improving the classifier’s performances. For the sake of completeness, we recall here that in the literature there is also an interesting direction of research, the aim of which is to build a new PK starting from the main five kernels that we introduced in Section 2.3. From one of the PKs mentioned in previous sections, the authors in [15,41] studied how to modify them, obtaining the so-called Variably Scaled Persistent Kernels, which are Variably Scaled Kernels applied to the classification context. The results reported by the authors are indeed promising. This could, therefore, be another interesting direction for further analysis.

Author Contributions

Conceptualization, C.B. and S.D.M.; methodology, C.B. and S.D.M.; software, C.B.; writing—original draft preparation, C.B. and S.D.M.; supervision, S.D.M.; funding acquisition, S.D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was achieved as part of RITA “Research ITalian network on Approximation” and as part of the UMI topic group “Teoria dell’Approssimazione e Applicazioni”. The authors are members of the INdAM-GNCS Research group. The project was also funded by the European Union-Next Generation EU under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.1—Call PRIN 2022 No. 104 of 2 February 2022 of the Italian Ministry of University and Research; Project 2022FHCNY3 (subject area: PE—Physical Sciences and Engineering) “Computational mEthods for Medical Imaging (CEMI)”.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TDA	Topological Data Analysis
PD	Persistence Diagram
PK	Persistence Kernel
PH	Persistent Homology
SVM	Support Vector Machine
PSSK	Persistence Scale-Space Kernel
PWGK	Persistence Weighted Gaussian Kernel
SWK	Sliced Wasserstein Kernel
PFK	Persistence Fisher Kernel
PI	Persistence Image

References

Townsend, J.; Micucci, C.P.; Hymel, J.H.; Maroulas, V.; Vogiatzis, K.D. Representation of molecular structures with persistent homology for machine learning applications in chemistry. Nat. Commun. 2020, 11, 3230. [Google Scholar] [CrossRef] [PubMed]
Asaad, A.; Ali, D.; Majeed, T.; Rashid, R. Persistent Homology for Breast Tumor Classification Using Mammogram Scans. Mathematics 2022, 10, 21. [Google Scholar] [CrossRef]
Pachauri, D.; Hinrichs, C.; Chung, M.K.; Johnson, S.C.; Singh, V. Topology based Kernels with Application to Inference Problems in Alzheimer’s disease. IEEE Trans. Med. Imaging 2011, 30, 1760–1770. [Google Scholar] [CrossRef]
Flammer, M. Persistent Homology-Based Classification of Chaotic Multi-variate Time Series: Application to Electroencephalograms. SN Comput. Sci. 2024, 5, 107. [Google Scholar] [CrossRef]
Majumdar, S.; Laha, A.K. Clustering and classification of time series using topological data analysis with applications to finance. Expert Syst. Appl. 2020, 162, 113868. [Google Scholar] [CrossRef]
Brüel-Gabrielsson, R.; Ganapathi-Subramanian, V.; Skraba, P.; Guibas, L.J. Topology-Aware Surface Reconstruction for Point Clouds. Comput. Graph. Forum 2020, 39, 197–207. [Google Scholar] [CrossRef]
Cohen-Steiner, D.; Edelsbrunner, H.; Harer, J. Stability of persistence diagrams. Discret. Comput. Geom. 2007, 37, 103–120. [Google Scholar] [CrossRef]
Reininghaus, J.; Huber, S.; Bauer, U.; Kwitt, R. A Stable Multi-Scale Kernel for Topological Machine Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4741–4748. [Google Scholar]
Barnes, D.; Polanco, L.; Peres, J.A. A Comparative Study of Machine Learning Methods for Persistence Diagrams. Front. Artif. Intell. 2021, 4, 681174. [Google Scholar] [CrossRef]
Kusano, G.; Fukumizu, K.; Hiraoka, Y. Kernel method for persistence diagrams via kernel embedding and weight factor. J. Mach. Learn. Res. 2017, 18, 6947–6987. [Google Scholar]
Carriere, M.; Cuturi, M.; Oudot, S. Sliced Wasserstein kernel for persistent diagrams. Int. Conf. Mach. Learn. 2017, 70, 664–673. [Google Scholar]
Le, T.; Yamada, M. Persistence fisher kernel: A riemannian manifold kernel for persistence diagrams. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Adams, H.; Emerson, T.; Kirby, M.; Neville, R.; Peterson, C.; Shipman, P.; Chepushtanova, S.; Hanson, E.; Motta, F.; Ziegelmeier, L. Persistence images: A stable vector representation of persistent homology. J. Mach. Learn. Res. 2017, 18, 1–35. [Google Scholar]
Zhao, Q.; Wang, Y. Learning metrics for persistence-based summaries and applications for graph classification. arXiv 2019, arXiv:1904.12189. [Google Scholar]
De Marchi, S.; Lot, F.; Marchetti, F.; Poggiali, D. Variably Scaled Persistence Kernels (VSPKs) for persistent homology applications. J. Comput. Math. Data Sci. 2022, 4, 100050. [Google Scholar] [CrossRef]
Fomenko, A.T. Visual Geometry and Topology; Springer Science and Business Media: New York, NY, USA, 2012. [Google Scholar]
Rotman, J.J. An Introduction to Algebraic Topology; Springer: New York, NY, USA, 1988. [Google Scholar]
Edelsbrunner, H.; Harer, J. Persistent homology—A survey. Contemp. Math. 2008, 453, 257–282. [Google Scholar]
Edelsbrunner, H.; Harer, J. Computational Topology: An Introduction; American Mathematical Society: Providence, RI, USA, 2010. [Google Scholar]
Guillemard, M.; Iske, A. Interactions between kernels, frames and persistent homology. In Recent Applications of Harmonic Analysis to Function Spaces, Differential Equations, and Data Science; Springer: Cham, Switzerland, 2017; pp. 861–888. [Google Scholar]
Carlsson, G. Topology and data. Bull. Am. Math. Soc. 2009, 46, 255–308. [Google Scholar] [CrossRef]
Pickup, D.; Sun, X.; Rosin, P.L.; Martin, R.R.; Cheng, Z.; Lian, Z.; Aono, M.; Ben Hamza, A.; Bronstein, A.; Bronstein, M.; et al. SHREC’ 14 Track: Shape Retrieval of Non-Rigid 3D Human Models. In Proceedings of the 7th Eurographics workshop on 3D Object Retrieval, EG 3DOR’14, Strasbourg, France, 6 April 2014. [Google Scholar]
Ali, D.; Asaad, A.; Jimenez, M.; Nanda, V.; Paluzo-Hidalgo, E.; Soriano-Trigueros, M. A Survey of Vectorization Methods in Topological Data Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14069–14080. [Google Scholar] [CrossRef]
Scholkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond; The MIT Press: Cambridge, MA, USA, 2002; ISBN 978-026-225-693-3. [Google Scholar]
Fasshauer, G.E. Meshfree Approximation with MATLAB; World Scientific: Singapore, 2007; ISBN 978-981-270-634-8. [Google Scholar]
The GUDHI Project, GUDHI User and Reference Manual, 3.5.0 Edition, GUDHI Editorial Board. 2022. Available online: https://gudhi.inria.fr/doc/3.5.0/ (accessed on 13 January 2022).
Tralie, C.; Saul, N.; Bar-On, R. Ripser.py: A lean persistent homology library for python. J. Open Source Softw. 2018, 3, 925. [Google Scholar] [CrossRef]
Giotto-tda 0.5.1 Documentation. 2021. Available online: https://giotto-ai.github.io/gtda-docs/0.5.1/library.html (accessed on 25 January 2019).
Saul, N.; Tralie, C. Scikit-tda: Topological Data Analysis for Python. 2019. Available online: https://docs.scikit-tda.org/en/latest/ (accessed on 25 January 2019).
Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Sonego, P.; Pacurar, M.; Dhir, S.; Kertész-Farkas, A.; Kocsor, A.; Gáspári, Z.; Leunissen, J.A.M.; Pongor, S. A Protein Classification Benchmark collection for machine learning. Nucleic Acids Res. 2006, 35, D232–D236. [Google Scholar] [CrossRef]
Sun, J.; Ovsjanikov, M.; Guibas, L. A Coincise and Provably Informative Multi-Scale Signature Based on Heat Diffusion. Comput. Graph. Forum 2009, 28, 1383–1392. [Google Scholar] [CrossRef]
Lee, D.; Lee, S.H.; Jung, J.H. The effects of topological features on convolutional neural networks—An explanatory analysis via Grad-CAM. Mach. Learn. Sci. Technol. 2023, 4, 035019. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: https://yann.lecun.com/exdb/mnist/ (accessed on 10 November 1998).
Garin, A.; Tauzin, G. A Topological “Reading” Lesson: Classification of MNIST using TDA. In Proceedings of the 18th IEEE International Conference On Machine Learning And Applications, Boca Raton, FL, USA, 16–19 December 2019; pp. 1551–1556. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Aktas, M.E.; Akbas, E.; El Fatmaoui, A. Persistent Homology of Networks: Methods and Applications. Appl. Netw. Sci. 2019, 4, 61. [Google Scholar] [CrossRef]
Ravinshanker, N.; Chen, R. An introduction to persistent homology for time series. WIREs Comput. Stat. 2021, 13, e1548. [Google Scholar] [CrossRef]
Dau, H.A.; Keogh, E.; Kamgar, K.; Yeh, C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Chen, Y.; Hu, B.; Begum, N.; et al. University of California Riverside. Available online: https://www.cs.ucr.edu/~eamonn/time_series_data_2018/ (accessed on 1 October 2018).
De Marchi, S.; Lot, F.; Marchetti, F. Kernel-Based Methods for Persistent Homology and Their Applications to Alzheimer’s Disease. Master’s Thesis, University of Padova, Padova, Italy, 25 June 2021. [Google Scholar]

Figure 1. An example of a valid simplicial complex (left) and an invalid one (right).

Figure 2. Example of PD with features of zero, one and two dimensions.

Figure 3. Example of bottleneck distance between two PDs in red and blue.

Figure 4. Comparison results about PSSK for SHREC14, in terms of condition number (left) and accuracy (right) with different

σ

.

Figure 5. Comparison results about the PWGK for MUTAG, in terms of accuracy with different

τ

and

ρ

.

Figure 6. Comparison results about the SWK for DHFR, in terms of condition number (left) and accuracy (right) with different

η

.

Figure 7. Comparison results about the PFK for MUTAG, in terms of accuracy with different t and

σ

.

Figure 8. Comparison results about the PI for BZR, in terms of condition number (left) and accuracy (right) with different

σ

.

Figure 9. Some elements of the SHREC14 dataset.

Figure 10. Orbits composed by the first 1000 iterations of the twisted map with

r = 3.5, 4.1, 4.3

from left to right, starting from the fixed random

(x_{0}, y_{0}) \in {[0, 1]}^{2}

.

Figure 11. Cubical simplices.

Figure 12. Example of an element in the MNIST dataset.

Figure 13. Example of an element in the FMNIST dataset.

Table 1. Accuracy related to point cloud and shape datasets (Balanced Accuracy only for the PROTEIN dataset). The best results are underlined in bold.

Kernel	PROTEIN	SHREC14	DYN SYS
PSSK	0.561	0.933	0.829
PWGK	0.538	0.923	0.819
SWK	0.531	0.935	0.841
PFK	0.556	0.935	0.784
PI	0.560	0.934	0.777

Table 2. Accuracy related to MNIST and FMNIST. The best results are underlined in bold.

Kernel	MNIST	FMNIST
PSSK	0.729	0.664
PWGK	0.754	0.684
SWK	0.802	0.709
PFK	0.734	0.671
PI	0.760	0.651

Table 3. Graph datasets.

Dataset	N° Graphs	N° Classes	IR
MUTAG	188	2	125:63
PTC	344	2	192:152
BZR	405	2	319:86
ENZYMES	600	6	100:100
DHFR	756	2	461:295
PROTEINS	1113	2	663:450

Table 4. Balanced Accuracy related to graph datasets using the shortest path distance (for the ENZYMES dataset, Accuracy only). The best results are underlined in bold.

Kernel	MUTAG	PTC	BZR	DHFR	PROTEINS	ENZYMES
PSSK	0.868	0.545	0.606	0.557	0.668	0.281
PWGK	0.858	0.510	0.644	0.655	0.694	0.329
SWK	0.872	0.511	0.712	0.656	0.686	0.370
PFK	0.842	0.534	0.682	0.656	0.694	0.341
PI	0.863	0.542	0.585	0.519	0.691	0.285

Table 5. Balanced Accuracy related to graph datasets using the Jaccard Index (for the ENZYMES dataset, Accuracy only) The best results are underlined in bold.

Kernel	MUTAG	PTC	BZR	DHFR	PROTEINS	ENZYMES
PSSK	0.865	0.490	0.704	0.717	0.675	0.298
PWGK	0.859	0.516	0.720	0.727	0.699	0.355
SWK	0.858	0.523	0.703	0.726	0.689	0.406
PFK	0.874	0.554	0.704	0.743	0.678	0.400
PI	0.846	0.478	0.670	0.712	0.690	0.280

Table 6. Time series datasets.

Dataset	N° Time Series	N° Classes	IR
ECG200	200	2	133:67
SONY	621	2	349:272
DISTAL	876	2	539:337
STRAWBERRY	983	2	632:351
POWER	1096	2	549:547
MOTE	1272	2	685:587

Table 7. Balanced Accuracy related to time series datasets. The best results are underlined in bold.

Kernel	ECG200	SONY	DISTAL	STRAWBERRY	POWER	MOTE
PSSK	0.642	0.874	0.658	0.814	0.720	0.618
PWGK	0.726	0.888	0.696	0.892	0.769	0.633
SWK	0.731	0.892	0.723	0.898	0.784	0.671
PFK	0.707	0.895	0.676	0.892	0.750	0.652
PI	0.717	0.841	0.662	0.793	0.712	0.606

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Persistence Symmetric Kernels for Classification: A Comparative Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Persistent Homology

Stability

2.2. Classification with SVM

2.3. Persistence Kernels

2.3.1. Persistence Scale-Space Kernel (PSSK)

2.3.2. Persistence Weighted Gaussian Kernel (PWGK)

2.3.3. Sliced Wasserstein Kernel (SWK)

2.3.4. Persistence Fisher Kernel (PFK)

2.3.5. Persistence Image (PI)

3. Results

3.1. Shape Parameters Analysis

3.2. Numerical Tests

3.3. Point Cloud Data and Shapes

3.3.1. Protein

3.3.2. SHREC14—Synthetic Data

3.3.3. Orbit Recognition

3.4. Images

MNIST and FMNIST

3.5. Graphs

3.6. One-Dimensional Time Series

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics