#### 2.2.1. Embedding

Let $\mathcal{D}=\{{\mathcal{G}}_{1},\dots ,{\mathcal{G}}_{{N}_{P}}\}$ be a dataset of ${N}_{P}$ graphs, where each graph has the form $\mathcal{G}=(\mathcal{V},\mathcal{E},{\mathcal{L}}_{v})$, where ${\mathcal{L}}_{v}$ is the set of vertices labels. For the sake of argument, let us consider a supervised problem, thus let $\mathcal{L}$ be the corresponding ground-truth class labels for each of the ${N}_{P}$ graphs in $\mathcal{D}$. Further, consider $\mathcal{D}$ to be split into three non-overlapping training, validation and test sets (${\mathcal{D}}_{\mathrm{TR}}$, ${\mathcal{D}}_{\mathrm{VAL}}$, ${\mathcal{D}}_{\mathrm{TS}}$, respectively) and, by extension, the labels $\mathcal{L}$ are split accordingly (${\mathcal{L}}_{\mathrm{TR}}$, ${\mathcal{L}}_{\mathrm{VAL}}$, ${\mathcal{L}}_{\mathrm{TS}}$). Let q be the number of classes for the classification problem at hand.

The first step is to evaluate the simplicial complex separately for all graphs in the three datasets splits, hence

where

$sc\left(\mathcal{G}\right):\mathcal{G}\to \mathcal{S}$ is a function that evaluates the simplicial complex starting from the 1-skeleton

$\mathcal{G}$.

However, the embedding is performed on the concatenation of ${\mathcal{D}}_{\mathrm{TR}}$ and ${\mathcal{D}}_{\mathrm{VAL}}$ or, specifically, ${\mathcal{D}}_{\mathrm{TR}}^{\mathrm{SC}}$ and ${\mathcal{D}}_{\mathrm{VAL}}^{\mathrm{SC}}$. In other words, the alphabet sees the concatenation of the simplices belonging to the simplicial complexes evaluated starting from all graphs in ${\mathcal{D}}_{\mathrm{TR}}$ and ${\mathcal{D}}_{\mathrm{VAL}}$.

In cases of large networks and/or large datasets, this might lead to a huge number of simplices which are hard to match. For example, let us consider any given node belonging to a given graph to be identified by a progressive unique number. In this case, it is impossible to match two simplices belonging to possibly two different simplicial complexes (i.e., determine whether they are equal or not). In order to overcome this problem, node labels ${\mathcal{L}}_{v}$ play an important role. Indeed, a simplex can dually be described by the set of node labels belonging to its vertices. This conversion from ’simplices-of-nodes’ to ’simplices-of-node-labels’ has a three-fold meaning, especially if node labels belong to a categorical and finite set:

the match between two simplices (possibly belonging to different simplicial complexes) can be done in an exact manner: two simplices are equal if they have the same order and they share the same set of node labels

simplicial complexes become multi-sets: two simplices (also within the same simplicial complex) can have the same order and can share the same set of node labels

the enumeration of different (unique) simplices is straightforward.

In light of these observations, it is possible to define the three counterparts of Equations (

1)–(3) where each given node

u belonging to a given simplex

$\sigma $ is represented by its node label:

Let

$\mathcal{A}$ be the set of unique (distinct) simplices belonging to the simplicial complexes evaluated from graphs in

${\mathcal{D}}_{\mathrm{TR}}\cup {\mathcal{D}}_{\mathrm{VAL}}$:

and let

$\left|\mathcal{A}\right|=M$. The next step is to properly build the embedding vectors thanks to the symbolic histograms paradigm. Accordingly, each simplicial complex

$\mathcal{S}$ (evaluated on the top of a given graph, that is, 1-skeleton) is mapped into an

M-length integer-valued vector

$\mathbf{h}$ as follows

where

$\mathrm{count}(a,b)$ is a function that counts the number of times

a appears in

b.

The three sets ${\mathcal{D}}_{\mathrm{TR}}$, ${\mathcal{D}}_{\mathrm{VAL}}$ and ${\mathcal{D}}_{\mathrm{TS}}$ are separately cast into three proper instance matrices ${\mathbf{D}}_{\mathrm{TR}}\in {\mathbb{R}}^{|{\mathcal{D}}_{\mathrm{TR}}|\times M}$, ${\mathbf{D}}_{\mathrm{VAL}}\in {\mathbb{R}}^{|{\mathcal{D}}_{\mathrm{VAL}}|\times M}$ and ${\mathbf{D}}_{\mathrm{TS}}\in {\mathbb{R}}^{|{\mathcal{D}}_{\mathrm{TS}}|\times M}$. For each set, the corresponding instance matrix scores in position $(i,j)$ the number of occurrences of the jth symbol (simplex) from $\mathcal{A}$ within the ith simplicial complex (in turn, evaluated on the top of the ith graph).

#### 2.2.2. Classification

In the embedding space, namely the vector space spanned by the symbolic histograms of the form as in Equation (

8), any classification system can be used. However, it is worth stressing the importance of feature selection whilst performing classification as per the following two (not mutually exclusive) rationales:

For a given classification system

C, let us consider its set of hyper-parameters

$\mathcal{H}$ to be tuned. Further, let

$\mathbf{w}\in {\{0,1\}}^{M}$ be an

M-length binary vector in charge of selecting features (columns) from the instance matrices (i.e., symbols from

$\mathcal{A}$) corresponding to non-zero elements. The tuple

can be optimised, for example, by means of a genetic algorithm [

71] or other metaheuristics.

In this work, two different classification systems are investigated. The former relies on non-linear

$\nu $-Support Vector Machines (

$\nu $-SVMs) [

72], whereas the latter relies on 1-norm Support Vector Machines (

${\ell}_{1}$-SVMs) [

73].

The rationale behind using the latter is as follows.

${\ell}_{1}$-SVMs, by minimising the 1-norm instead of the 2-norm of the separating hyperplane as in standard SVMs [

72,

74,

75], return a solution (hyperplane coefficient vector) which is sparse: this allows to perform feature selection during training.

For the sake of sketching a general framework, let us start our discussion from

$\nu $-SVMs which do not natively return a sparse solution (i.e., do not natively perform any feature selection). The

$\nu $-SVM is equipped with the radial basis function kernel:

where

$\mathbf{x},\mathbf{y}$ are two given patterns from the dataset at hand,

$D(\xb7,\xb7)$ is a suitable (dis)similarity measure and

$\gamma $ is the kernel shape parameter. The adopted dissimilarity measure is the weighted Euclidean distance:

where

M is the number of features and

${\mathbf{w}}_{i}\in \{0,1\}$ is the binary weight for the

ith feature. Hence, it is possible to define

$\mathcal{H}=[\nu ,\gamma ]$ and the overall genetic code for

$\nu $-SVM has the form

Each individual from the evolving population exploits ${\mathbf{D}}_{\mathrm{TR}}$ to train a $\nu $-SVM using the parameters written in its genetic code as follows:

The optimal hyperparameters set is the one that minimises the following objective function on

${\mathbf{D}}_{\mathrm{VAL}}$:

where

J is the (normalised (Originally, the informedness is defined as

$J=(\mathrm{Sensitivity}+\mathrm{Specificity}-1)$ and therefore is bounded in

$[-1,+1]$. However, since the rightmost term in Equation (

13) is bounded in

$[0,1]$ and

$\alpha \in [0,1]$, we adopt a normalised version in order to ensure a fair combination.)) informedness (The informedness, by definition, takes into account binary problems. In case of multiclass problems, one can evaluate the informedness for each class by marking it as positive and then consider the average value amongst the problem-related classes.) [

76,

77], defined as:

whereas the rightmost term takes into account the sparsity of the feature selection vector

$\mathbf{w}$. Finally,

$\alpha \in [0,1]$ is a user-defined parameter which weights the contribution of performances (leftmost term) and number of selected alphabet symbols (rightmost term). As the evolution ends, the best individual is evaluated on

${\mathbf{D}}_{\mathrm{TS}}$.

As previously introduced,

${\ell}_{1}$-SVMs minimise the 1-norm of the separating hyperplane and natively return a sparse hyperplane coefficient vector, say

$\mathbf{\beta}$. In this case, the genetic code will not consider

$\mathbf{w}$ and only

$\mathcal{H}$ can be optimised. For

${\ell}_{1}$-SVMs the genetic code has the form

where

C is the regularisation parameter and

$\mathbf{c}\in {\mathbb{R}}^{q}$ are additional weights in order to adjust

C in a class-wise fashion (

$\mathbf{c}$ is not mandatory for

${\ell}_{1}$-SVMs to work, but it might be of help in case of heavily-unbalanced classes.). Specifically, for the

ith class, the misclassification penalty is given by

$C\xb7{\mathbf{c}}_{i}$. The evolutionary optimisation does not significantly change with respect to the

$\nu $-SVM case: each individual trains a

${\ell}_{1}$-SVM using the hyperparameters written in its genetic code on

${\mathbf{D}}_{\mathrm{TR}}$ and its results are validated on

${\mathbf{D}}_{\mathrm{VAL}}$. The fitness function is still given by Equation (

13) with

$\mathbf{\beta}$ in lieu of

$\mathbf{w}$. As the evolution ends, the best individual is evaluated on

${\mathbf{D}}_{\mathrm{TS}}$.