Designing labeled graph classifiers by exploiting the R\'enyi entropy of the dissimilarity representation

Representing patterns by complex relational structures, such as labeled graphs, is becoming an increasingly common practice in the broad field of computational intelligence. Accordingly, a wide repertoire of pattern recognition tools, such as classifiers and knowledge discovery procedures, are nowadays available and tested for various labeled graph data types. However, the design of effective learning and mining procedures operating in the space of labeled graphs is still a challenging problem, especially from the computational complexity viewpoint. In this paper, we present a major improvement of a general-purpose graph classification system, which is conceived on an interplay among dissimilarity representation, clustering, information-theoretic techniques, and evolutionary optimization. The improvement focuses on a specific key subroutine of the system that performs the compression of the input data. We prove different theorems which are fundamental to the setting of such a compression operation. We demonstrate the effectiveness of the resulting classifier by benchmarking the developed variants on well-known datasets of labeled graphs, considering as distinct performance indicators the classification accuracy, the computing time, and the parsimony in terms of structural complexity of the synthesized classification model. Overall, the results show state-of-the-art standards in terms of test set accuracy, while achieving considerable reductions for what concerns the effective computing time and classification model complexity.


Introduction
A labeled graph offers a powerful model for representing patterns characterized by interacting elements, in either a static or a dynamic scenario. Applications of labeled graphs as data representation models can be cited in almost every scientific context, such as electrical circuits [21,54], networks of dynamical systems [60,77], biochemical interaction networks [20,31,75], general time-varying labeled graphs [5], and segmented images [18,57,71]. Owing to the rapid diffusion of (cheap) multicore computing hardware, and motivated by the increasing availability of interesting datasets describing complex interaction-oriented patterns, recent researches on graph-based pattern recognition systems have produced numerous distinct solutions [1,4,6,10,14,25,29,39,40,43,46,48,49,53,57]. Focusing on the high-level design of classification systems for graphs, it is possible to identify two main different approaches: those that operate directly in the domain of labeled graphs and those that deal with the classification problem in a suitable embedding space. Of notable interest are those systems that are based on the so-called explicit graph embedding algorithms, which transform the input graphs into numeric vectors by means of a mapping or feature extraction technique [43]. Such a transformation, on one hand simplifies the problem of learning and testing a classification model by enabling the possibility of adopting well-established pattern recognition tools; and on the other hand it may also cause information loss due to the intrinsic complexity of the labeled graph data type, which synthesizes both topological and semantic information (i.e., the vertex/edge attributes) of data in the same structure. Nevertheless, designing graph embedding based classification systems by means of a core Inexact Graph Matching (IGM) procedure operating directly in the labeled graphs domain proved to be an effective solution [6,9,45,48,66]. The dissimilarity representation offers a valuable framework for this purpose, since it permits to describe arbitrarily complex objects by means of their pairwise dissimilarity values (DV) [58].
Recently, the Optimized Dissimilarity Space Embedding (ODSE) system has been proposed as a general labeled graph classifier, showing state-of-the-art results in terms of classification accuracy on wellknown benchmarking datasets [48]. The synthesis of the ODSE classification model is performed by a novel information-theoretic interpretation of the dissimilarity matrix (DM) in terms of conveyed information. In practice, the system estimates the informativeness of the input data dissimilarity representation by calculating the quadratic Rényi entropy (QRE) [61]. Such an entropic characterization of the DVs underlying distribution proved to be effective, and it has been used in the compression-expansion scheme as well as an important factor of the ODSE objective function, core elements of the ODSE model synthesis. Deriving the ODSE classification model is however computationally demanding. As a consequence, we have developed two improved versions of the ODSE graph classification system [45], which are founded on a fast clustering-based compression (CBC) scheme. The setting of such a clustering algorithm is determined analytically, by relying on a formal proof. This fact caused a considerable computational speed-up of the model synthesis stage, maintaining state-of-the-art test set classification performances.
In this paper, we elaborate further over the same CBC scheme estimating the differential α-order Rényi entropy of the DVs by means of a faster technique that relies on the so-called entropic Minimum Spanning Tree (MST) [7]. Also in this case, we give a formal proof pertaining the setting of the clustering algorithm governing the compression. We experimentally demonstrate that the performance of ODSE operating with the MST-based estimator is comparable to the kernel-based estimator version, although, in general, with the former the overall computing time is lower. Such results strengthen the effectiveness of the methodological approach underlying ODSE.
The remainder of the paper is organized as follows. Section 2 gives the necessary background and collocates the paper in the appropriate scientific context. In Section 3 we give an overview of the original ODSE graph classification system design [48]. Throughout Section 4 we present the improved ODSE system, which is primarily discussed considering the QRE estimator. In Section 4.3, we discuss a relevant topic related to the (worst-case) efficiency of the developed CBC procedure. Section 5 introduces the principal theoretical contribution of this paper, which consists in the use of the MST-based estimator as a component of the ODSE system; the theorem related to the CBC setting is shown here. Experiments and comparisons with other graph classifiers on well-known benchmarking datasets are presented in Section 6. Results are discussed by considering many different ODSE variants. In particular, we consider two classification systems operating in the dissimilarity space (DS): a k -NN rule based classifier and a neurofuzzy min-max network (MMN) trained with the ARC algorithm [67]. Conclusions and future directions follow in Section 7.

Entropy-based Data Characterization
Designing pattern recognition (sub)systems by using concepts derived from information theory is nowadays a consolidated practice [8,15,25,32,41,74]. A key issue in this context is the estimation of informationtheoretic quantities from a given dataset, such as entropy and mutual information. The entropy of a dis-tribution describes and quantifies the uncertainty of the modeled system/process in terms of randomness. Such a quantification can be widely used in data analysis for characterizing the observed system/process at a mesoscopic level. From the groundbreaking work of Shannon, different generalized formulations have been proposed. Here we are interested in the generalization proposed by Rényi, which is called α-order Rényi entropy. Given a continuous random variable X, distributed according to a probability density function p(·), the α-order Rényi entropy measure is defined as: In the following two subsections, we explain in detail the non-parametric α-order Rényi entropy estimation techniques that we use as components of the ODSE graph classifier. In Sec. 2.1.1 we introduce the kernelbased entropy estimation technique proposed by Príncipe [61], while in Sec. 2.1.2 we show another entropy estimation technique that is based on the calculation of the entropic MST [7].

The QRE Estimator
Recently, Príncipe [61] provided a formulation of Eq. 1 in terms of the so-called information potential of order α, V α (X), ( When α = 2 holds in Eq. 2, the entropy measure simplifies to the so-called quadratic Rényi entropy (QRE). Non-parametric kernel-based estimators provide a "plug-in" solution to the problem of estimating information-theoretic data descriptors, such as the entropy. A well-known density estimation technique is the Parzen-Rosenblatt windowing, also called kernel density estimation. Usually, a zero-mean Gaussian kernel function G σ (·) is adopted, giving rise to a probability density estimator of the form: The Gaussian kernel G σ (·) enables a controllable bias-variance trade-off of the estimator dependent on the kernel size σ (and on the data sample size n). According to Príncipe [61], the QRE of the joint distribution of a d -dimensional random vector can be estimated by relying on d different unidimensional kernel estimators combined as follows: where V 2,σ (·) is the quadratic information potential (QIP). QIP is computed assuming the d random variables to be independent and by using the convoluted Gaussian kernel G σ √ 2 (·) with doubled variance, evaluated at the difference among the input measurements (realizations). As any probability-based uncertainty measure, the QRE assumes its maximum value when the distribution is uniform, where ∆ is a (finite) measure of the input data extent [61]. Notably, Eq. 5 can be used to normalize the (estimated) QRE within the [0, 1] range. O(dn 2 ) kernel evaluations are needed to compute (4), which may become onerous due to the cost of computing the exponential function.

The MST-based Entropy Estimator
Let X n be the data sample of n measurements (points), with x i ∈ R d , i = 1, 2, ..., n, and d ≥ 2, and let G(X n ) be the complete (entropic) graph constructed over these n measurements. An edge e ij of such a graph connects x i and x j in R d by means of a straight line described by the length |e ij |, which is computed taking the Euclidean distance: The α-order Rényi entropy (1) can be estimated according to a geometric interpretation of a MST of G(X n ) in R d (MST-RE). To this end, let L γ (X n ) be the weighted length of a MST T connecting the n points, which is defined as where γ ∈ (0, d) is a user-defined parameter, and T (G(X n )) is the set of all possible (entropic) spanning trees of G(X n ). The Rényi entropy of order α ∈ (0, 1), elaborated using the MST length (7), is defined as follows [7,34]:Ĥ where the order α is determined by calculating: The β(L γ , d) term is a constant (given the data dimensionality) that can be approximated, for large enough dimensions, d, as: Modifying γ we obtain different α-order Rényi entropies. This fact suggests that the selection of the γ parameter could be performed according to a suitable performance criterion defined for the task at hand. By definition of G(X n ), MST-RE (8) is not sensible to the dimensionality of the input measurements.
Assuming to perform the estimation on a set of n measurements in R d , the computational complexity involved in computing Eq. 8 is given by: The first term in (11) accounts for the generation of G(X n ), computing the respective Euclidean distances for the edge weights. The second term quantifies the cost involved in the MST computation using the well-known Kruskal's algorithm; this cost can be reduced by adopting faster approximations for the MST computation [7], which however will not be considered in this paper. The last term in (11) concerns the computation of the MST length.

The Dissimilarity Representation for Pattern Recognition
In the dissimilarity representation [58], the elements of an input dataset S ⊂ X are characterized by considering their pairwise DVs. The key component is the definition of a nonnegative (bounded) dissimilarity measure d : X × X → R + , which is in charge of synthesizing all relevant commonalities among the input patterns of S into a single real-valued number. A set of prototypes, R, called representation set (RS), is used to develop the DM, D, which is given as D ij = d(x i , r j ), for every x i ∈ S and r j ∈ R.
Let D be the vector space on which the input data of S is represented through the elaboration of D. The (geo)metric structure underlying D depends firstly on the properties of the adopted dissimilarity measure d(·, ·), which has been used to construct D. For instance, if d(·, ·) obeys the common metric requirements, then D is said to be a metric DM, and obtaining a metric embedding space D is straightforward. However, when facing problems defined in complex input spaces, metric requirements, although highly exploitable from the pattern analysis viewpoint, can be also "unnecessary" for concrete pattern recognition needs [22,59]. Notwithstanding, different (usually computationally expensive) correction techniques can be applied to a non-metric DM [23,27].
Pȩkalska and Duin [58] presented three techniques to obtain D from the DM, D. The first one consists in using directly the rows of D as embedding vectors, obtaining the so-called DS embedding. This is the fastest way to represent the input data as vectors using the DM information. In addition, any common algebraic structure can be defined in the DS, making this approach very flexible. The second approach consists in performing a linear embedding of a (corrected) DM into an Euclidean space [58,Sec. 3.5]. Such methods aim to preserve the original DVs of D as much as possible, and, at the same time, to reduce the dimensionality of the embedding. An instance of these methods is the so-called pseudo-Euclidean embedding. Finally, they considered non-linear mappings, called spatial representations [58,Sec. 3.6]. Examples of those techniques include nonlinear multidimensional scaling methods.
The selection of the prototypes, R, plays of course an important role. A prototype selection technique operates equivalently to a (combinatorial) feature selection procedure, where the focus is on selecting a characterizing and essential subset of elements of S. On the other hand, well-known approaches of feature transformation/extraction can be applied directly on D, by exploiting statistical or geometrical information of the embedded input data.
The dissimilarity representation has found many applications so far, such as image analysis, query processing, and signature verification [3,13,35,52,56,73], as well as for the specific aim of designing labeled graph clustering and classification systems [9,48,66].

The Inexact Graph Matching Problem
A labeled graph (also called attributed graph) [43] is a tuple G = (V, E, µ, ν), where V is the finite set of vertices, E ⊆ V × V is the set of edges, µ : V → L V is the vertex labeling function, with L V denoting the set of vertex labels, and finally ν : E → L E is the edge labeling function, with L E denoting the set of edge labels. The topology of a graph enables the characterization of a pattern in terms of interacting elements. Moreover, the generality of both L V and L E allows to cover a broad range of real-world patterns. A (labeled) graph is said directed if the set E contains ordered pairs of vertices, i.e., (v i , v j ) ∈ E with v i , v j ∈ V; otherwise it is said undirected. In this paper, without loss of generality, we assume to deal with undirected labeled graphs only.
IGM algorithms, which can be idealized as nonnegative functions defined in the labeled graphs domain, G, face the difficult task of providing an algorithmic solution for quantifying the diversity among two given labeled graphs. Such a problem is non-trivial as a consequence of the difficulty of defining a proper and effective-in-practice (geo)metric structure on G, and because of the variety of possible vertex and edge label types. Classification, clustering, function approximation, and various mining and modeling problems defined in G essentially rely on an IGM mechanism, which is usually the most important component [12,16,29,33,43]. Livi and Rizzi [43] identify three mainstream approaches in the modern literature for dealing with the IGM problem: Graph Edit Distance (GED), Graph kernels, and Graph embedding. GED algorithms [26,55,76] search for the so-called minimum cost edit path among two graphs, which corresponds to the optimal sequence of basic edit operations (i.e., substitutions, deletions, and insertions) that transforms a graph into the other; solving such problem is however NP-Hard [55]. A particularly interesting class of GED algorithms is implemented in terms of the so-called assignment problem, which basically aims to find an assignment-association among the vertices of two input graphs, by successively inducing the edge operations [26]. Graph kernels [1,2,42,50,72] are proper positive definite kernel functions, which enable hence the applicability of the whole family of kernelized learning machines (e.g., the well-known support vector machine -SVM - [70]) in G.
Graph embedding algorithms [6,9,17,24,25,30,44,51,62,66,69], on the other hand, operate by explicitly developing an embedding space, D. The DV among two graphs is hence computed processing their vector representations in D, usually by either a geometric or information-theoretic perspective (e.g., divergence-based [25,38]). We distinguish two main categories of graph embedding algorithms: those that are defined in terms of a core IGM procedure working directly in G, and those that exploit a matrix representation of the graph to extract characterizing information. The former (e.g., see [6,9,66]) can process virtually any type of labeled graph, according to the capability of the adopted core matching algorithm. Such a core IGM procedure is used to explicitly define the embedding vectors. The latter, although those approaches are usually defined on a more sophisticated mathematical setting (e.g., see [24,46,62,69]), are constrained to process a restricted variety of labeled graphs, in which all the relevant information can be effectively condensed into a matrix representation of the graph, such as the (weighted) adjacency, transition, and Laplacian matrix.
The interested reader is referred to [12,33,43] and therein references for a comprehensive review of recent graph embedding techniques.

The Original ODSE Graph Classifier
The ODSE graph classification system [48] is founded on an explicit graph embedding mechanism that represents the input set of graphs S, n = |S|, using a suitable RS R, d = |R|, by initially computing the corresponding DM, D n×d . The configuration of the embedding vectors representing the input data in D is derived directly using the rows of D. The adopted IGM dissimilarity measure is the symmetric version of the GED procedure called best matching first, weighted using a three-weight edit scheme (TWEC). Although TWEC provides a heuristic solution to the GED problem, it has shown a good compromise between computational complexity (quadratic in the graph order), the number of characterizing parameters, and the reasonability of the dissimilarity evaluation [6,43,48]. TWEC operates a greedy assignment of the vertices among the two input graphs on the base of the corresponding labels dissimilarity; edge operations are induced accordingly.
ODSE synthesizes the classification model optimizing the DS representation by means of two dedicated operations, called compression and expansion. Both operations make use of the QRE estimator (Sec. 2.1.1) to quantify the information conveyed by the DM. Since the DVs fall into a continuous interval, the underlying distribution is assumed to be continuous as well. When the distribution is uniform, the entropy reaches its unique maximum value (see Eq. 5), which is used in the ODSE system to normalize the entropy evaluations in [0, 1]. To make this computation straightforward, TWEC has been re-normalized within the [0, 2] range, yielding a DVs extent equal to ∆ = 2.
Another important component of the ODSE graph classification system is the inner feature-based classifier, which operates directly in D; its own classification model is trained contextually to the ODSE synthesis. Such a classifier can be any well-known classification system, such as an MMN [67], or a kernelized SVM [70]. Test labeled graphs are classified by ODSE feeding the corresponding dissimilarity representation to the learned feature-based classifier, which assigns proper class labels to the test patterns. Fig. 1(a) and 1(b) give, respectively, the schematics of the ODSE embedding procedure and of the overall synthesis (i.e., optimization) of the ODSE classifier. The ODSE classification model is defined by the RS, R i , the TWEC parameters, P i , and the learned feature-based classifier, M i -see Fig. 1(b). During the synthesis stage additional parameters are optimized, which are fundamental to the determination of the best-performing ODSE classification model. Those parameters, which are synthetically denoted as Σ i in Fig.  1(a), are the kernel size σ used by the entropy estimator and the two entropy thresholds, τ c , τ e , which play a fundamental role in the compression and expansion operations, respectively. The ODSE model is synthesized by cross-validating the learned models on the training set S tr over a suitable validation set S vs . The global optimization is governed by a genetic algorithm, since the recognition rate, π i , guides among other factors the optimization, and its analytical definition with respect to (w.r.t.) the model parameters is not available in closed form. The genetic algorithm, although it does not assure convergence towards a global optimum, it is easily and effectively parallelizable, allowing to make use of multicore hardware/software implementations.

The ODSE Objective Function
All parameters characterizing the ODES model are arranged into codes, c i ∈ C. These include the two entropy thresholds {τ c , τ e } i , the kernel size of the entropy estimator, {σ} i , the weights of TWEC and any parameter of the vertex/edge label dissimilarity measures, all ranging in [0, 1]. Since each c i induces a specific RS, R i , the optimization problem that characterizes the ODSE synthesis consists in deriving the best-performing RSR:R = arg max The objective function in (12) is defined as a linear convex combination of two objectives, where η ∈ [0, 1] and Φ Ri (·) shorten the dissimilarity representation of an entire dataset using the compressedand-expanded RS instance, R i . The function f 1 (·, ·) evaluates the recognition rate achieved on a validation set S vs , while f 2 (·) accounts for the quality of the synthesized classification model. Specifically, where ς ∈ [0, 1], and Θ denotes the cost related to the number d i of prototypes. Accordingly, where ζ is the number of classes characterizing the classification problem at hand. The second term, namely Υ, captures the informativeness of the DM: We consider the entropy factor (16) in the ODSE objective function (13) to increase the spread-dispersion of the DVs, which in turn is assumed to magnify the separability of the classes. In fact, by assuming some cluster structure in the input data, graphs belonging to the same class would share similar characteristics w.r.t. R i .

The ODSE Compression Operation
The compression operation searches for subsets of the initial RS, R, that convey the same information w.r.t. S tr ; the initial RS is equal to the whole S tr in the original ODSE. In order to describe the mechanism behind the ODSE compression operation, we need to define when a given subset B ⊆ R of prototypes is considered compressible. Let D n×d be the DM corresponding to S tr and R, with n = |S tr | and d = |R|. Basically, B individuates a subset of k = |B| ≤ d columns of D. Let D[B] n×k be the filtered DM, i.e., the submatrix considering the prototypes in B only. We say that D[B] n×k is compressible if where 0 ≤ τ c ≤ 1 is the compression threshold, and H(·) estimates the QRE of the underlying joint distribution of D[B] n×k . In practice, the values of D[B] n×k are interpreted as k measurements of a ndimensional random vector; D k is the corresponding notation that we use throughout the paper to denote a sample of k random measurements elaborated from the DM. If the considered measurements are concentrated around a single n-dimensional support point, the estimated joint entropy is close to zero. This fact allows us to use Eq. 17 as a systematic compression rule, retaining only a single representative prototype graph of B.
The selection of the subsets B i , i = 1, 2, ..., p, for the compressibility evaluation is the first important algorithmic issue to be addressed. In the original ODSE [48], the subset selection has been performed by means of a randomized algorithm. The computational complexity of this approach is O d 3 n , which not scales adequately as the input size grows.

The ODSE Expansion Operation
The expansion is considered for each single R j ∈ ← − R, by analyzing the corresponding columns of the compressed DM, D n×d . By denoting with D n the sample containing the n DVs corresponding to the j -th column of D, we say that R j is expandable if where 0 ≤ τ e ≤ 1 is the expansion threshold. Practically, the information provided by the prototype is low if the n unidimensional measurements are concentrated around a single real-valued number. In such a case, the estimated entropy would be low, approaching zero as the underlying distribution becomes degenerate. Examples of such prototypes are outliers and prototype graphs that are equal in the same measure to all other graphs. Once an expandable R j is individuated through (18), R j is substituted by extracting ζ new graphs elaborated from S tr . Notably, those new graphs are derived by searching for recurrent subgraphs in a suitable subset of the training graphs.
Although the idea of trying to extract new features by searching for (recurrent) subgraphs is interesting, it is also very expensive in terms of computational complexity.

The Improved ODSE Graph Classifier
The improved ODSE system [45] follows the same learning scheme, but with the primary goal of a significant computational speed-up. The first variant, which is presented in Sec. 4.1, considers a simple yet fast RS initialization strategy, and a more advanced compression mechanism. The compression is grounded on the formal result discussed in Sec

Randomized Representation Set Initialization
The initial RS R, that is, the RS used during the synthesis, is defined by sampling the S tr according to a selection probability, p. The size of the initial RS is thus characterized by a binomial distribution, containing in average |S tr |p graphs, with variance |S tr |p(1 − p). Although such a selection criteria is linear in the training set size, it operates blindly and may cause an unbalanced selection of the prototypes considering the prior class distributions. However, such a simple sampling scheme is used only in cases where the available hardware cannot process the entire dataset at hand.

Compression by Clustering-based Subset Selection
The entropy measured by the QRE estimator (4) is used to determine the compressibility of a subset of prototypes, B. Since the entropy estimation is directly related to the DVs among the graphs of B, we design a subset selection strategy that aggregates the initial prototypes according to their mutual distance in the DS. Such subsets are assured to be compressible by definition, avoiding thus the computational burden involved in the entropy estimation.
We make use of the well-known Basic Sequential Algorithmic Scheme (BSAS) clustering algorithm (see the pseudo-code of Alg. 1) with the aim of grouping the n-dimensional dissimilarity column-vectors x j , j = 1, 2, ..., d, with (hyper)spheres, using the Euclidean metric d 2 (·, ·). The main reason behind the adoption of such a simple cluster generation rule is that it is much faster than other more sophisticated approaches [28], and it gives full control on the generated cluster geometry through a real-valued parameter, θ. Since θ constrains each cluster B l to have a maximum intra-cluster DV (i.e., a diameter) lower or equal to 2θ, we can deduce analytically the value of θ considering the particular instance of the kernel size σ c and the entropy threshold τ c used in Eq. 17. Accordingly, the following theorem allows us to determine the partition P (θ; τ c , σ c ) that contains only certainly-compressible clusters (see [45] for the proof).
Theorem 1. The compressible partition P (θ; τ c , σ c ) obtained on a training set S tr of n graphs, is derived setting: Algorithm 1 BSAS cluster generation rule.
Input: The ordered n input elements, a dissimilarity measure d(·, ·), the cluster radius θ, and the maximum number of allowed clusters Q Output: The partition P (θ) 1: for i = 1, 2, ..., n do 2: if P (θ) = ∅ then 3: Create a new cluster in P (θ) and define x i as the set representative 4: end if 5: Get the distance value D from the closest representative modeling a cluster of the current partition P (θ) 6: if D > θ AND |P (θ)| < Q then 8: Add a new cluster in P (θ) and define x i as the set representative 9: else 10: Add x i in the j -th cluster, and update the set representative element 11: end if 12: end for The optimization of τ c and σ c , together with the proof of Theorem 1, allows us to search for the best level of training set compression for the problem at hand. Alg. 2 shows the pseudo-code of the herein described compression operation. Since the ultimate aim of the compression is to aggregate prototypes that convey similar information w.r.t. S tr , we represent a cluster using the Minimum Sum Of Distances (MinSOD) technique [19]. In fact, the MinSOD cluster representation technique allows us to select a single representative element x k ∈ B k according to the following expression: Eventually, the p prototype graphs, B i , i = 1, 2, ..., p, corresponding to the p computed MinSOD elements in the DS, populate the compressed RS, Algorithm 2 Clustering-based compression algorithm.
Input: The initial set of prototype graphs R, |R| = d, the DM D n×d , the compression threshold τc, and the kernel size σc Output: The compressed set of prototype graphs ← − R 1: Configure BSAS setting Q = |R| and θ according to Eq. 19 2: Let X = (x 1 , x 2 , ..., x d ) be the (ordered) set of dissimilarity vectors elaborated from the columns of D 3: Execute the BSAS on X . Let P (θ; τc, σc) = {B 1 , B 2 , ..., Bp} be the obtained compressible partition 4: Compute the MinSOD element b i of each cluster B i , i = 1, 2, ..., p, according to Eq. 20. Retrieve from R the prototype graph B i corresponding to each dissimilarity vector b i 5: Define The search interval for the kernel size σ c can be effectively reduced according to the following range: .
A proof for (21) can be found in [45]. This restriction is important, since it allows to narrow the search interval for the kernel size σ c , which is theoretically defined in the whole extended real line.

Expansion based on Replacement with Maximum Dissimilar Graphs
The genetic algorithm evolves a population of models over the iterations t = 1, 2, ..., max. Let R 0 be defined as shown in Sec. 4.1.1, and let N t = S tr \ R t−1 be the set of unselected training graphs at iteration t ≥ 1. Finally, let ← − R t be the compressed RS of iteration t. The herein described expansion operation makes use of the elements of N t replacing in ← − R t those prototypes that do not discriminate the classes. The check for the expansion of a single prototype graph is still performed as described in Sec. 3.3. If the estimated entropy from the j-th column vector is lower than the expansion threshold, τ e , l new training graphs are selected from N t for each class, where l ≥ 1 is a user-defined factor. Those ζ × l new graphs are selected such that they result maximally dissimilar w.r.t. the j-th prototype under analysis. The new expansion procedure is outlined in [45,Alg. 2].
Since compression and expansion are evaluated considering different interpretations of the DM, we use two different kernel sizes: σ c and σ e , respectively.

Computational Complexity Analysis
The computational complexity is dictated by the execution of the genetic algorithm, O(I + EP × F ). I is the cost of the RS initialization, E is the number of (maximum) evolutions, P is the population size, and finally F is the cost related to a single fitness function evaluation. In this system variant, the initialization is linear in the training set size, O(I) = O(|S tr |); in average we select d = |S tr |p prototypes. The detailed cost related to the fitness function, O(F ), is articulated as the sum of the following costs: The first cost, F 1 , is related to the generation of the initial DM corresponding to S tr and the RS obtained through the initialization of Sec. 4.1.1; g is the cost of the adopted IGM procedure. F 2 is due to the compression operation which consists in a single BSAS execution, where C = d is the cache size of the MinSOD [19], Q = d , and e = n is the cost of a single Euclidean distance computation. F 3 is the cost characterizing the expansion operation; N is the cardinality of the set N t . This operation is repeated at most ← − d = | ← − R| times, with a quadratic entropy estimation cost in the training set size. F 4 is the cost related to the embedding of the DM, and F 5 is due to the classification of the validation set using a k -NN rule based classifier -this cost is updated according to the specific classifier. F 6 is the cost for the QRE over the compressed-and-expanded DM.
As it is possible to deduce from Eq. 22, the model synthesis is now characterized by a quadratic cost in the training set size, n, as well as in the RS size, d, while in the original ODSE it was (pseudo) cubic in both n and d.

ODSE with Mode Seeking Initialization
This ODSE version diverges from the one described in Sec. 4 from the fact that it does not implement any expansion operation, while it adopts a more elaborated initialization of the RS. Notably, the initialization of the RS is now part of the synthesis, since it depends on some of the parameters tuned during the optimization. Compression is still implemented as described in Sec. 4.1.2.
The initialization makes use of the Mode Seek (MS) algorithm [58], which is a well-known procedure that is able to individuate the modes of a distribution. For each class c i , i = 1, 2, ..., ζ, and considering a user-defined neighborhood size s ≥ 1, the algorithm proceeds as illustrated in [45,Alg. 3]. The elements of R found in this way are the estimated modes of the class distribution. The cardinality of R depends on the choice of s: the larger is s, the smaller is R. This approach is very appropriate when elements of the same class are distributed in different and heterogeneous clusters; the cluster representatives are the modes individuated by the MS algorithm. Moreover, the MS algorithm can be useful to filter out the outliers, since they are characterized by a low neighborhood density. The procedure is dependent on the user-defined neighborhood size s, which influences directly the outcome of the initialization. Additionally, since the neighborhood is defined in the graph domain, MS is also dependent on the weights characterizing TWEC. For this very reason, the initialization is now performed during the ODSE synthesis.
To limit the complexity of such an initialization, in the experiments we systematically assign small values to s, constraining the search in small neighborhoods. A possible side effect of this choice is that we can individuate an excessive number of prototypes/modes. This effect is however attenuated by the cascade execution of the compression algorithm (2).

Computational Complexity Analysis
The overall computational cost of the synthesis is now bounded by O(EP × F ). The two main steps of the fitness function (detailed in (23)) involve the execution of the MS algorithm followed by the compression algorithm. The F 1 cost refers to the MS algorithm. |c i | is the number of training data belonging to the i -th class. F 2 refers to the generation of the initial DM, constructed using S tr and the d ≤ |S tr | initial prototypes. F 3 is the cost of the compression operation, with Q = d . F 4 , F 5 , and F 6 are equivalent to the ones described in Sec. 4.2.1. The overall cost is dominated by the initialization stage (the F 1 cost), which is (pseudo) quadratic in the class size |c i |, and quadratic in the neighborhood size, s.

The Efficiency of the ODSE Clustering-based Compression
BSAS (Alg. 1) is characterized by a linear computational complexity. However, due to the sequential processing nature, the outcome is sensible to the data presentation order. In the following, we study the effect caused by the ordering of the input over the effectiveness of the CBC, by calculating what we called ODSE compression efficiency factor.
Let s = (x 1 , x 2 , ..., x n ) be the sequence of dissimilarity vectors describing the n prototypes in the DS, which are presented in input to Alg. 1. Let Ω(s) be the set of all permutations of the sequence s. We define the optimal compression ratio ρ * (s) for the sequence s as: where ← − R i is the compressed RS obtained by analyzing the prototypes arranged according to s i , and R is the uncompressed RS, i.e., the initial RS. Letρ(s) be the effective compression ratio, achieved by ODSE considering a generic ordering of s. The ratio describes the asymptotic efficiency of the ODSE compression as the initial RS size grows.
The proof can be found in Appendix A. An interpretation of the result of Theorem 2 is that, in the general case, the asymptotic efficiency of the implemented CBC varies within the [2/3, 1] range of the optimum compression.

ODSE with the MST-based Rényi Entropy Estimator
In the following, we contextualize the MST-RE estimation technique introduced in Sec. 2.1.2, as a component of the improved ODSE system presented in Sec. 4. Notably, we provide a theorem for determining the θ parameter of BSAS used in the compression operation (Alg. 2). In this case, we generate clusters according to the particular instance of τ c and of the γ parameter, since the kernel size parameter, σ c , is not present in the MST-based estimator. The γ parameter is optimized during the ODSE synthesis. While γ is defined in (0, d), where d is the dimensionality of the considered samples, we restrict the search interval to (0, U ], with U = 3 in the experiments. This technical choice is motivated by the fact that γ is used in Eq. 7 as exponent, and an excessively large value would easily cause overflow problems of the MST length variable floating-point representation. Theorem 3. Considering the instances of γ and τ c , the compressible partition P (θ; τ c , γ) is derived executing the BSAS algorithm on n = |S tr | training graphs by setting: The proof of the theorem can be found in Appendix B. Setting up θ according to Eq. 26 constrains the BSAS to generate certainly-compressible clusters only. Since τ c and γ are optimized during the synthesis of the classifier, the result of Theorem 3, likewise the one of Theorem 1, allows us to evaluate different levels of training set compression according to the overall system performance on the specific input data instance. It goes without saying that computational complexities discussed in the previous sections are updated by considering the cost of the MST-based estimator (see Eq. 11).

Experiments
The experimental evaluation of the improved ODSE classifier is organized in three subsections. First, in Sec. 6.1 we introduce the considered benchmarking datasets. In Sec. 6.2 we discuss the main experimental setting adopted in this paper. Finally, in Sec. 6.3 we show and discuss the results.

Datasets
The experimental evaluation is performed on the well-known IAM graph benchmarking databases [63]. The IAM benchmarking repository contains many different datasets representing patterns from various contexts, from images to biochemical compounds. In particular, we considered the Letter LOW (L-L), Letter MED (L-M), Letter HIGH (L-H), AIDS (AIDS), Proteins (P), GREC (G), Mutagenicity (M), and finally the Coil-Del (C-D) datasets. The first three are datasets of digitized characters modeled as labeled graphs, which are characterized by three different levels of noise. The AIDS, P, and M datasets represent biochemical networks, while G and C-D are images of various type. For the sake of brevity, we report only essential details in Tab. 1, referring the reader to Ref. [63] (and therein references) for a more in-depth discussion about the data. Moreover, since each dataset contains graphs characterized by different vertex and edge labels, we adopted the same vertex and edge dissimilarity measures described in [6,48].

Experimental Setting
The ODSE system version described in Sec. 4.1 is denoted as ODSE2v1, while the version described in Sec. 4.2 as ODSE2v2. Those two versions make use of the QRE estimator; the setting of the clustering algorithm parameter θ used during the compression is therefore performed according to the result of Theorem 1. By following the same algorithmic scheme, we consider two additional ODSE variants that differ only in the use of the MST-RE estimator. We denote those two variants as ODSE2v1-MST and ODSE2v2-MST. The setting of θ is hence performed according to the proof of Theorem 3. However, the MST-based estimator is conceived for high-dimensional measurements. As a consequence, in the ODSE2v1-MST system version we still use the QRE estimator in the expansion operation, since during the expansion we analyze unidimensional distributions of DVs. We considered two core classifiers operating in the DS. The first one is a k -nearest neighbors (k -NN) rule based classifier equipped with the Euclidean distance, testing three values of k: 1, 3, and 5. This first choice is dictated by the fact that, primarily, we need to compare the results herein presented directly with our previous works [45,48]. As a second classifier we used a fast MMN, which is trained with the ARC algorithm [67]. The four aforementioned ODSE variants (i.e., ODSE2v1, ODSEv2, ODSEv1-MST, and ODSE2v2-MST) are therefore replicated into additional four variants that are straightforwardly denoted as ODSE2v1-MMN, ODSEv2-MMN, ODSEv1-MST-MMN, and ODSE2v2-MST-MMN, meaning that we just use the neuro-fuzzy MMN on the embedding space, instead of the k -NN. Tab. 2 summarizes all considered ODSE configurations. Tests are executed setting the genetic algorithm with a (fixed) population size of 30 individuals, and performing a maximum of 40 evolutions for the synthesis; a check on the fitness value is however performed terminating the optimization if the fitness does not change for 15 evolutions. This setup has been chosen to allow a fair comparison with the previously obtained results [45,48]. The genetic algorithm performs roulette wheel selection, two-point crossover, and random mutation on the aforementioned codes c i , encoding the real-valued model parameters; in addition, the genetic algorithm implements an elitism strategy which automatically imports the fittest individual into the next population. In all considered configurations, we executed the system setting η = 0.9 and ς = 0.2 in Eq. 13 and 14, respectively. Moreover, the neighborhood size parameter s characterizing the MS algorithm, which affects both ODSE2v2 and ODSE2v2-MST versions, has been set as follows: 10 for the L-L, L-M, and L-H, 20 for AIDS, 2 for P, 8 for G, and finally 100 for either M and C-D. Note that these values has been defined according to the training dataset sizes and considering some preliminary tests. Each dataset has been processed five times using different random seeds, reporting hence the average test set classification accuracy together with its standard deviation. We report also the required average serial CPU time and the average RS size obtained after the synthesis. Tests have been conducted on a regular desktop machine with an Intel Core2 Quad CPU Q6600 at 2.40GHz and 4Gb of RAM; software is implemented in C++ on a Linux operating system (Ubuntu 12.04) by making use of the SPARE library [47]. Finally, the computing time is measured using the clock() routine of the standard ctime library. Table 2: Summary of the ODSE configurations evaluated in the experiments. The "Init" column refers to the RS initialization scheme, "Compression / Est." refers to the compression algorithm and adopted entropy estimator, "Expansion / Est." the same but for the expansion algorithm, and "Objective Func. (16)" refers to the entropy estimator adopted in Eq. 16. Finally, "FB Classifier" specifies the feature-based classifier operating in the DS.

Results and Discussion
All test set classification accuracy results have been collected in Tab. 3. Those include the results of three baseline reference systems and several state-of-the-art (SOA) classification systems based on graph embedding techniques. The table is divided in appropriate macro blocks to simplify the comparison of the results. The three reference systems are denoted as RPS+TWEC+k -NN, k -NN+TWEC, and RPS+TWEC+MMN. The first one performs a (class-independent) randomized selection of the training graphs to develop the dissimilarity representation of the input data. This system adopts the same TWEC used in ODSE and performs the classification in the DS by means of a k -NN classifier equipped with the Euclidean distance. The second one differs from the first system by using instead the MMN. Finally, the third reference system operates directly in G by means of a k -NN rule based classifier equipped with TWEC. In all cases, to obtain a fair comparison with ODSE, the configuration of the dissimilarity measures for the vertex/edge labels is consistent with the one adopted for ODSE. Additionally, k = 1, 3, and 5 is used in the k -NN rule, performing the TWEC parameters optimization (i.e., the weighting parameters in [0, 1]) by means of the same aforementioned genetic algorithm implementation. Therefore, also in this case the test set results must be intended as the average of five different runs (however we omit standard deviations for the sake of brevity). Tab. 3 presents the obtained test set classification accuracy results, while Tab. 4 gives the corresponding standard deviations. As expected, results obtained with the baseline reference systems are always inferior w.r.t. those of ODSE, regardless of the specific system configuration. Test set classification accuracy percentages obtained by ODSE2v1-MST and ODSE2v2-MST are comparable with those of ODSE2v1 and ODSE2v2, although we note a slightly general improvement for the first two variants. Results are also more stable varying the neighborhood size parameter, k, of the k -NN rule. Note that for difficult datasets, such as P and C-D, increasing the neighborhood size in the k -NN rule affects significantly the test set performances (i.e., results degrade considerably). Test set classification accuracy results obtained by means of the MMN operating in the DS are in general (slightly) inferior w.r.t. the ones obtained with the k -NN rule -setting k = 1. This results is not too "controversial" since the k -NN rule is a valuable classifier, especially in absence of noisy data. Since ODSE operates by searching for the best-performing DS for the data at hand, we may deduce that the embedding vectors are sufficiently well-organized w.r.t. the classes. Test set results on the first four datasets (i.e., L-L, L-M, L-H, and AIDS) denote an important improvement over a large part of the considered SOA systems. On the hand, results over the P, G, and M datasets are comparable w.r.t. those of the SOA systems. For all the considered ODSE configurations, we observe non convincing results on the C-D dataset; in this case results are comparable only with those of the reference systems (first block of Tab. 3). However, a rational reason explaining this fact is not emerged from the tests yet, requiring thus more future investigations. In general standard deviations (Tab. 4) are always reasonably small, denoting a reliable classifier regardless the particular ODSE variant.
We already demonstrated that in ODSE2 it is possible to appreciate a reduction of the theoretical computational complexity -i.e., from cubic to quadratic. To complement this study, we now move to the analysis of the effective computing time. The calculated serial CPU time is shown in Tab. 5, which includes the whole ODSE synthesis and test set evaluation. The ODSE variants based on the MST entropy estimator denote general improvements (with the only exception for the P and C-D datasets). This is especially observed by considering the first four datasets, in which the speed-up factor w.r.t. the original ODSE increases considerably. As expected, the speed-up factors obtained by means of the variants using the MMN are in general higher. In fact, the MMN properly synthesizes a classification model over the training data embedded into the DS, significantly reducing the computing time necessary for the evaluation of the test set. In Tab. 6 we report the CPU time specifically calculated for the test set evaluation, which demonstrates the superiority (from the CPU time viewpoint) of the variants operating with the MMN. This result might assume more importance in particular applications, especially in those where the synthesis of the classifier can be effectively performed only once in off-line mode and the classification model is employed to process high-rate data streams in real-time [68]. Now let us focus on the complexity of the synthesized classification models. The cardinality of the bestperforming RSs are shown in Tab. 5. It is possible to note that the cardinality are slightly bigger for those variants operating with MST-RE (especially in the first three datasets, i.e., L-L, L-M, and L-H). From this fact we deduce that, when configuring the CBC procedure with the MST-RE estimator, the ODSE classifier, in order to obtain good results in terms of test set accuracy, requires a more complex model w.r.t. the variants involving the QRE estimator. This behavior is however magnified by the setting of the objective function parameter η adopted in our tests, which biases the ODSE system towards the recognition rate performances. Notably, variants operating with the MMN develop considerable less costly classification models (see Tab. 7 and 8 for the details). This particular aspect becomes very significative in resource-constrained scenarios or when the input datasets are very big. The considerable reductions of the RS size strengthen the fact that the entropy estimation operates adequately in the dissimilarity representation context.

Conclusions and Future Directions
In this paper, we have presented different variants of the improved ODSE graph classification system. All the discussed variants are based on the characterization of the informativeness of the DM through the αorder Rényi entropy estimation. The first adopted estimator computes the QRE by means of a kernel-based density estimator, while the second one elaborates the α-order entropy from the length of the entropic MST constructed over the considered samples. The improved ODSE system has been designed by providing different strategies for the initialization, compression, as well as for the expansion operation of the RS. In particular, we conceived a fast CBC scheme, which allowed us to control directly the compression level of the data through the explicit setting of the cluster radius parameter. We provided formal proofs for the two considered estimation techniques. Those proofs enabled us to calculate the setting of the cluster radius analytically according to the ODSE model optimization procedure. We have studied also the asymptotic worst-case efficiency of the CBC scheme implemented by means of the sequential cluster generation rule (BSAS).
Experimental evaluations and comparisons with many state-of-the-art systems have been performed on Table 3: Test set classification accuracy results of the reference systems, several SOA graph embedding based classification systems, and the considered ODSE variants (grayed lines denote novel results introduced in this paper). The "-" sign means that the result is not available to our knowledge. In bold face we report the best-performing system for each dataset.      7  82  126  801  770  ODSE2v1-MM  136  192  144  6  190  163  563  555  ODSE2v2-MM  197  546  80  2  93  115  815  740  ODSE2v1-MST  597  595  597  6  198  283  687  618  ODSE2v2-MST  551  574  447  61  122  129  813  775  ODSE2v1-MST-MMN  600  606  500  5  190  184  424  549  ODSE2v2-MST-MMN  550  580  411  61  93  115 456 733  -MMN  15  39  34  5  43  27  164  357  ODSE2v2-MMN  15  28  41  4  48  28  159  368  ODSE2v1-MST-MMN  15  27  38  3  48  28  168  348  ODSE2v2-MST-MMN  15  27  34  4  43  27  175  365 well-known benchmarking datasets of labeled graphs (IAM database). We considered two different featurebased classifiers operating in the DS: the k -NN classifier equipped with the Euclidean distance and a neurofuzzy MMN trained with the ARC algorithm. Overall, the variants adopting the MST-based estimator resulted to be faster but less parsimonious for what concerns the synthesized ODSE model (i.e., the cardinality of the best-performing RS). The use of the k -NN rule (with k = 1) yielded slightly better test set accuracy results w.r.t. the MMN, while however in the latter case we have observed significative differences in term of (serial) CPU computing time, especially focusing on the test set processing stage. The test set classification accuracy results confirm the effectiveness of the ODSE classifier w.r.t. the state-of-the-art standards. Moreover, the significative CPU time improvements w.r.t. the original ODSE version, and the highly parallelizable global optimization scheme based on a genetic algorithm, bring the ODSE graph classifier one step closer towards the applicability to bigger labeled graphs and larger datasets. The vector representation of the input graphs have been obtained directly using the DM rows. Such a choice, while it is known to be effective, has been mainly dictated by the computing time requirements of the system. It is worth analyzing the performances of ODSE also when the embedding space is obtained by the (non)linear embedding of the (corrected) DM. Future experiments could include testing other core IGM procedures, different α-order Rényi entropy estimators, and additional feature-based classifiers, such as SVM or other neural network types. Finally, the design of the ODSE system is generic enough to be adapted quite straightforwardly to other input domains. We will therefore evaluate the possibility to develop an ODSE-like classification system for sequences of generic objects.

A Proof of Theorem 2
Proof. We focus on the worst-case scenario for ξ, giving thus a lower bound for the efficiency (25). Let s[i] = x i denote the i -th element of the sequence s, i.e., the i -th dissimilarity vector corresponding to the prototype graph R i ∈ R. Let s * be the best ordering for s, i.e., s * = arg max si∈Ω(s) ρ(s i ).
Let us assume the case in which the Euclidean distance among any pair of vectors in s is given by where θ is the adopted cluster radius during the ODSE compression. It is straightforward to understand that this is the worst-case scenario for the compression purpose in the sequential clustering setting. In fact, each vector x i in the sequence s has a distance with its predecessor/successor equal to the maximum cluster radius θ. As a consequence, there is still the possibility to compress the vectors, but it is however strictly dependent on the specific ordering of s. First of all, it is important to note that only three elements of s can be contained into a single cluster, due to the distances assumed in (28). In fact, any three consecutive elements of the sequence s would form a cluster with a diameter equal to 2θ. Therefore, considering the sequential association rule shown in Alg. 1, and setting Q = n, the best possible ordering s * is the one that preserves a distance equal to θ for any two adjacent elements of s, achieving a compression ratio of: ρ * (s) = n/ n/3 .
The worst possible ordering, instead, yields n/ n/2 , which can be achieved (for instance assuming n to be odd) when considering the following ordering s i w.r.t. the optimal s * : s i [j] = s * [(2j mod n) + 1], j = 1, 2, ..., n.
In this case, Alg. 1 would generate exactly n/2 (31) clusters, corresponding to the first n/2 elements of the sequence s i , since every pair of consecutive elements in s i is at a distance of exactly 2θ. Therefore, n/2 is the maximum number of clusters that can be generated by considering the distances assumed in (28). Combining Eq. 29 and 31, we obtain for a given s, n/ n/2 ≤ρ(s) ≤ ρ * (s) = n/ n/3 , which allows us to claim that the worst-case efficiency of the ODSE compression varies according to the following ratio:ρ (s)/ρ * (s) = n n/2 × n/3 n = n/3 n/2 .
Taking the limit for n → ∞ in Eq. 33 gives us the claim.

B Proof of Theorem 3
Proof. Let us focus the analysis on a single cluster B ∈ P (θ; τ c , γ), containing k = |B| prototypes within a training set of n graphs. Let us remind that the cluster radius and diameter are, respectively, θ and 2θ in the spherical cluster case. Therefore, we can obtain an upper bound for the MST length factor (7), considering that (all) the corresponding MST, T , of the complete graph generated from the k measurements has k − 1 edges with weights equal to 2θ. Specifically, In the following, we evaluate β(L γ (θ), n) exactly as defined in Eq. 10, considering n dimensions (note that β(L γ (θ), n) is shortened as β). Eq. 34 allows us to derive the following upper bound for the MST-based entropy estimator (8) = n γ [ln(k − 1) + γ ln(2θ) − ln(k α ) − ln(β)] .