3.1. Shape Parameters Analysis
From the definitions of the aforementioned kernels, it is evident that each of them depends on some parameters, and it is not clear, at present, which values are assigned to them. The cross-validation phase—or, for abbreviation, the CV phase—tries to answer to this question. For instance, in the CV phase the user defines a set of values for the parameters, and for any possible admissible choice he solves the classification task and measures the goodness of the obtained results. It is well-known that in the RBF interpolation literature [
25] the shape parameters have to be chosen after a process similar to the CV phase. In the beginning, the user chooses some values for each parameter, and then, varying them, one checks the condition number and the interpolation error, noting how they vary according to the values of the parameters. The trade-off principle suggests considering values where the condition number is not huge (ill-conditioned) and the interpolation error is too small (accuracy). Now, in the context of classification, we have to replace the concepts of the condition number of the interpolation matrix and the interpolation error. We can achieve this goal by considering the condition number of the Gram matrix and the accuracy of the classifier. To obtain a good classifier, it is desirable to have a small condition number of the Gram matrix and high accuracy, as close to 1 as possible. The aim here was to run such an analysis for the kernels presented in this paper.
The PSSK has only one parameter to tune: . Typically, the users consider . We ran the CV phase for different shuffles of a dataset and plotted the results in terms of the condition number of the Gram matrix related to the training samples and the accuracy. For our analysis, we considered  and we ran tests on some datasets cited in the following. The results were similar in each case, so we decided to report those for the SHREC14 dataset.
From the 
Figure 4, it is evident that large values of 
 result in an unstable matrix and less accuracy. Then, in what follows, we will take into account only 
.
The PWGK is the kernel with a higher number of parameters to tune; therefore, it was not so evident what were the best-set values to take into account. We chose reasonable starting sets as follows: , , , . Due to a large number of parameters, we first ran some experiments varying  with fixed , and then we reversed the roles.
We report here in 
Figure 5 only a plot for fixed 
 and 
p because this highlights how high values of 
 (for example 
) were excluded. We found this behavior for different values of 
 and various datasets—here, for the case 
, 
 and the MUTAG dataset. Therefore, we decided to vary the parameters, as follows: 
, 
, 
, 
. Unfortunately, there was no other evidence that could guide the choices, except for 
, where values 
 always had bad accuracy, as one can see below in the case of MUTAG with the shortest path distance.
In the case of the 
SWK, there is only one parameter, 
. In [
11], the authors proposed to consider values starting from the first and last decile and with the median value of the gram matrix of the training samples flattened, in order to obtain a vector; then, they multiplied these three values for 
.
For our analysis, we decided to study the behavior of such kernels, considering the same set of values independently from the specific dataset. We considered .
We ran tests on some datasets, and the plot, related to the DHFR dataset, revealed evidently that large values for 
 were to be excluded, as suggested by 
Figure 6. So, we decided to take 
 only in 
.
The 
PFK has two parameters: the variance 
 and 
t. In [
12], the authors exhibited the procedure to follow, in order to obtain the corresponding set of values. It shows that the choice of t depends on 
. Our aim in this paper was to carry out an analysis that was dataset-independent, which turned out to be strictly connected only to the definition of the kernel itself. First, we took different values for 
 and we plotted the corresponding accuracies—here, in the case of MUTAG with the shortest path distance, but the same behavior holds true also for other datasets.
The condition numbers were indeed high for every choice of parameters and, therefore, we avoided reporting here, because it would have been meaningless. From the 
Figure 7, it is evident that it is convenient to set 
 lower or equal to 10, while 
t should be set larger or equal to 0.1. Thus, in what follows, we took into account 
 and 
.
In the case of the 
PI, we considered a reasonable set of values for the parameter 
. The results were related to BZR with the shortest path distance redand shown in 
Figure 8.
As in the previous kernels, it seemed that the accuracy was better for small values of . For this reason, we set .
  3.5. Graphs
In many different contexts, from medicine to chemistry, data can have the structure of graphs. Graphs are couples of a set 
, where 
V is the set of vertices and 
E is the set of edges. The graph classification is the task of attaching a label/class to each whole graph. In order to compute the persistent features, we needed to build a filtration. In the context of graphs, as in other cases, there are different definitions; see, for example, in [
38].
We considered the Vietoris–Rips filtration, where, starting from the set of vertices, at each step we would add the corresponding edge whose weights were less or equal to a current value 
. This turned out to be the most common choice, and the software available online allowed us to build it after providing the corresponding adjacency matrix. In our experiments, we considered only undirected graphs, but, as in [
38], building a filtration is possible also for directed graphs. Once defining the kind of filtration to use, one needs again to choose the corresponding weights. We decided to take into account first the shortest path distance and then the Jaccard index, as, for example, in [
14].
Given two vertices  the shortest path distance was defined as the minimum number of different edges that one has to meet going from u to v or vice versa, since the graphs here were considered as undirected. In graphs theory, this is a widely use metric.
The Jaccard index is a good measure of edge similarity. Given an edge 
 then the corresponding 
Jaccard index is computed as
        
        where 
 is the set of neighbors of 
u in the graph. This metric recovers the local information of nodes, in the sense that two nodes are considered similar if their neighbor sets are similar.
In both cases, we considered the sub-level set filtration and we collected both zero- and one-dimensional persistent features.
We took six of such sets among the graph benchmark datasets, all undirected, as follows:
- MUTAG: a collection of nitroaromatic compounds, the goal being to predict their mutagenicity on Salmonella typhimurium; 
- PTC: a collection of chemical compounds represented as graphs that report the carcinogenicity of rats; 
- BZR: a collection of chemical compounds that one has to classify as active or inactive; 
- ENZYMES: a dataset of protein tertiary structures obtained from the BRENDA enzyme database; the aim is to classify each graph into six enzymes; 
- DHFR: a collection of chemical compounds that one has to classify as active or inactive; 
- PROTEINS: in each graph, nodes represent the secondary structure elements; the task is to predict whether or not a protein is an enzyme. 
The properties of the above are summarized in 
Table 3, where the IR index is the so-called 
Imbalanced Ratio (
IR) that denotes the imbalance of the dataset, and it is defined as a sample size of the major class over a sample size of the minor class.
The computations of the adjacency matrix and the PDs were made using the functions implemented in tda-giotto.
The performances achieved with the two edge weights are reported in 
Table 4 and 
Table 5.
Thanks to these results, two conclusions can be reached. The first one is that, as expected, the goodness of the classifier is strictly related to the particular filtration used for the computation of persistent features. The second one is related to the fact that the SWK and the PFK seem to work slightly better than the other kernels: in the case of the shortest path distance in 
Table 4, the SWK is to be preferred while the PFK seems to work better in the case of the Jaccard index 
Table 5. In the case of PROTEINS, in both cases the PWGK provides the best Balanced Accuracy.