Next Article in Journal
Behavioural Profiling of Cycling and Walking in Nine European Cities
Next Article in Special Issue
KMS as a Sustainability Strategy during a Pandemic
Previous Article in Journal
A Bibliometric Analysis of Forest Gap Research during 1980–2021
Previous Article in Special Issue
Healthcare Recommender System Based on Medical Specialties, Patient Profiles, and Geospatial Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Novel Features and Neighborhood Complexity Measures for Multiclass Classification of Hybrid Data

by
Francisco J. Camacho-Urriolagoitia
1,
Yenny Villuendas-Rey
2,*,
Cornelio Yáñez-Márquez
1 and
Miltiadis Lytras
3,4
1
Centro de Investigación en Computación del Instituto Politécnico Nacional, Juan de Dios Bátiz s/n, Gustavo A. Madero, Mexico City 07738, Mexico
2
Centro de Innovación y Desarrollo Tecnológico en Cómputo del Instituto Politécnico Nacional, Juan de Dios Bátiz s/n, Gustavo A. Madero, Mexico City 07700, Mexico
3
School of Business, Deree—The American College of Greece, 6 Gravias Street, GR-153 42 Aghia Paraskevi, 15342 Athens, Greece
4
College of Engineering, Effat University, Jeddah 21478, Saudi Arabia
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(3), 1995; https://doi.org/10.3390/su15031995
Submission received: 19 December 2022 / Revised: 10 January 2023 / Accepted: 17 January 2023 / Published: 20 January 2023
(This article belongs to the Special Issue Knowledge Management in Healthcare)

Abstract

:
The present capabilities for collecting and storing all kinds of data exceed the collective ability to analyze, summarize, and extract knowledge from this data. Knowledge management aims to automatically organize a systematic process of learning. Most meta-learning strategies are based on determining data characteristics, usually by computing data complexity measures. Such measures describe data characteristics related to size, shape, density, and other factors. However, most of the data complexity measures in the literature assume the classification problem is binary (just two decision classes), and that the data is numeric and has no missing values. The main contribution of this paper is that we extend four data complexity measures to overcome these drawbacks for characterizing multiclass, hybrid, and incomplete supervised data. We change the formulation of Feature-based measures by maintaining the essence of the original measures, and we use a maximum similarity graph-based approach for designing Neighborhood measures. We also use ordering weighting average operators to avoid biases in the proposed measures. We included the proposed measures in the EPIC software for computational availability, and we computed the measures for publicly available multiclass hybrid and incomplete datasets. In addition, the performance of the proposed measures was analyzed, and we can confirm that they solve some of the biases of previous ones and are capable of natively handling mixed, incomplete, and multiclass data without any preprocessing needed.

1. Introduction

Several disciplines, such as Pattern Recognition, Machine Learning, Computational Intelligence, and Artificial Intelligence share an interest in automatic knowledge management strategies [1]. The latter is becoming an active research area useful for several daily-life aspects such as environmental concerns (i.e., clothing sustainability [2], fault detection in the subsea [3], and pollution analysis [4]), education (teacher training [5] and other educational applications [6]), and health (disease prediction [7], secure Internet of Things in healthcare [8] and other healthcare challenges [9]), among others.
There are numerous supervised classification algorithms, such as Neighborhood-based classifiers [10], Decision trees [11], Neural networks [12], Support vector machines [13], Associative classifiers [14], and Logical-combinatorial classifiers [15]. However, because of the No Free Lunch theorems [16], no algorithm will outperform all others for all problems and performance measures. That is why meta-learning, as a knowledge management technique, is the focus of several research efforts [17,18,19].
Most meta-learning strategies are based on determining data characteristics, usually by computing data complexity measures [20,21]. Such measures aim at describing data characteristics [22,23,24] related to size, shape, density, and others. However, the majority of the data complexity measures in the literature assume the classification problem is binary (just two decision classes), and that the data is numeric and has no missing values.
Unfortunately, in many real-life applications, such assumptions are not fulfilled. Real data is often hybrid (described by both numeric and categorical attributes) and can present an absence of information. In addition, several problems have multiple possible outcomes or decisions to make; they have multiple decision classes.
As stated before, data complexity measures [20] are usually defined for numeric, complete, and binary classification problems. Therefore, our aim is to extend the measures for characterizing multiclass, hybrid, and incomplete supervised data. The main contributions of this paper are the following:
  • We extend four data complexity measures for the multiclass classification scenario and for dealing with hybrid and incomplete data.
  • We include the proposed four measures in the EPIC software [25,26] for computational availability.
  • We compute the proposed measures for publicly available multiclass hybrid and incomplete datasets.
This paper is organized as follows: Section 2 reviews some of the related works on data complexity measures. Section 3 introduces the extended data complexity measures, and Section 4 shows some of the properties of the proposed measures, as well as their computation over publicly available datasets. Finally, we present some conclusions and future works.

2. Related Works

This section reviews the existing data complexity measures of the Feature-based, Linearity, and Neighborhood taxonomies for dealing with single-label supervised classification. Section 2.1 offers the preliminary concepts. Section 2.2 details Feature-based complexity measures. Section 2.3 explains Linearity measures, and Section 2.4 covers Neighborhood measures.

2.1. Preliminaries

Let U be a universe of instances, described by a set of features A = { A 1 ,   ,   A m } . Each attribute A i has a definition domain, which can be Boolean, numeric, or categorical, and can have missing values (denoted by ?). Let x U be an instance. Its value in the i-th feature is denoted by x [ i ] . Let us have a decision attribute or label and a set of decision classes d = { d 1 ,   ,   d c } . The true label value of the instance x U is denoted by c l a s s ( x ) . If the decision class d has two values, we have a binary single-label classification. On the other hand, if d has more than two values, we have a multi-class single-label classification problem (Figure 1).
Since 2002, there has been an interest in assessing the complexity of supervised data for automatic classification problems [23]. Several data complexity measures have been introduced [27].
An established taxonomy of data complexity measures is the following [27]:
  • Feature-based measures describe how useful the features are to separate the classes.
  • Linearity measures try to determine the extent of linear separation between the classes.
  • Neighborhood measures characterize class overlapping in neighborhoods.
  • Network measures focus on the structural information of the data.
  • Dimensionality measures estimate data sparsity.
  • Class imbalance measures consider the number of instances in the classes.
In the following analysis, we survey some of the most used data complexity measures of the Feature-based, Linearity, and Neighborhood taxonomies and describe their advantages and disadvantages and their potential to be extended for multidimensional classification.

2.2. Feature-Based Measures

Feature-based measures consider the individual power of the features regarding the classification task. They are usually easy to compute, but they do not consider feature dependencies.

2.2.1. Maximum Fisher’s Discriminant Ratio (F1)

F1 is a measure devoted to assessing the discriminant power of individual features. It is given by:
F 1 = 1 1 + m a x i = 1 m r i
where r i is a discriminant ratio for the i-th feature. Several formulations of r i exist in the literature, all considering means and standard deviation of classes. Let μ 1 and μ 2 be the means of the classes, and let σ 1 and σ 2 be the standard deviations. The original formulation by Ho and Basu [23] is:
r i = ( μ 1 μ 2 ) 2 σ 1 2 + σ 2 2
The F1 measure has several drawbacks. It assumes data is numeric and complete (no missing values) and that the classification problem is binary. In addition, it assumes that the linear boundary is perpendicular to one of the feature axes. If features are separable but with an oblique line, it does not capture such information.

2.2.2. Volume of the Overlapping Region (F2)

The original F2 formulation [23] has the same disadvantages as F1 and fails in classes where one or more features occupy different intervals. It is also susceptible to noise in the data.
F 2 = i m o v e r l a p (   f i   )   r a n g e   (   f i   ) = i m max { 0 , min max (   f i   )   max min (   f i   ) } c max max (   f i   )     min min (   f i   ) min max ( f i ) = min ( max ( f i c 1 ) , max ( f i c 2 ) ) max min ( f i ) = max ( min ( f i c 1 ) , min ( f i c 2 ) ) max max ( f i ) = max ( max ( f i c 1 ) , max ( f i c 2 ) ) min min ( f i ) = min ( min ( f i c 1 ) , min ( f i c 2 ) )
where max (   f i c j )   and min (   f i c j )   are the maximum and minimum values is a feature for class cj ∈ {1, 2}, respectively.
F2 measure has the same disadvantages as F1 but also fails in classes where one or more features occupy different intervals. It is also susceptible to noise in the data. To overcome some of these drawbacks, Cummins introduced a modified version (F2’) [28].

2.2.3. Maximum Individual Feature Efficiency (F3)

F3 measures the capability of individual features for class separation as follows:
F 3 = m i n i = 1 m n o (   A i   ) n
where n o (   A i   ) is the number of overlapping instances according to A i , I is the indicator function, and n is the number of instances ( | U | ), as:
n o ( A i ) = x U I ( x [ i ] > max min ( A i )   x [ i ] < min max ( A i ) )
In the same manner as F1 and F2, the F3 measure assumes data is numeric, complete, and has only two decision classes.

2.3. Linearity Measures

Linearity measures aim at determining if it is possible to separate the classes by a hyperplane under the hypothesis that a linearly separable problem is simpler than a non-linear one.

2.3.1. Sum of the Error Distance by Linear Programming (L1)

This measure computes the sum of the distances of the instances incorrectly classified as:
L 1 = 1 1 1 + S u m E r r o r D i s t = S u m E r r o r D i s t 1 + S u m E r r o r D i s t
where an optimization process of a Support Vector Machine (SVM), which is used to compute the desired hyperplane, determines ε j values.

2.3.2. Error Rate of Linear Classifier (L2)

L2 computes the error ratio of the linear SVM classifier. Let h ( x ) be the linear classifier, L2 is computed as:
L 2 = i = 1 n I ( h ( x i ) y i ) n
The main issues with L1 and L2 these measures are:
  • It assumes two classes, as it uses an SVM;
  • It assumes data is numeric and complete;
  • It uses a fixed classifier (SVM) to establish the linearity of the data.

2.3.3. Non-linearity of Linear Classifier (L3)

The L3 measure starts by obtaining an interpolated set of instances of cardinality l . Let h T ( x ) be the linear classifier obtained using the training data and x i be the interpolated instances; L3 is as follows:
L 3 = 1 l i = 1 l I ( h T ( x i ) c l a s s ( x i ) )
According to [27], L3 is sensitive to how the data from a class are distributed in the border regions and also on how much the convex hulls that delimit the classes overlap. In addition, L3 shares the disadvantages of L1 and L2.

2.4. Neighborhood Measures

The data complexity measures based on neighborhoods aim to capture the structure of the classes and/or the shape of the decision boundaries. Such measures use distances or dissimilarity functions between instances and, therefore, are bounded by O ( n 2 ) .

2.4.1. Fraction of Borderline Points (N1)

A Minimum Spanning Tree is built and denoted as M S T = ( U , E ) . Each vertex corresponds to an instance, and the edges are weighted according to the distance between them. N1 is obtained by calculating the percentage of instances in vertices ( x i , x j ) U × U involved in edges that connect instances of opposite classes in the generated tree, as:
N 1 = 1 n i = 1 n I ( ( x i , x j ) E   c l a s s ( x i ) c l a s s ( x j ) )
The N1 measure is not deterministic, due to the fact that there can be multiple MSTs for the same dataset. However, its formulation is suitable for multiclass data and allows the use of different functions for computing the dissimilarities in hybrid and incomplete data.

2.4.2. Ratio of Intra/Extra Class Nearest Neighbor Distance (N2)

This measure calculates the ratio of two sums: (i) the sum of the distances between each example and its nearest neighbor of the same class (intra class); and (ii) the sum of the distances between each example and its nearest neighbor of another class (extra class)
i n t r a _ e x t r a = i = 1 n d ( x i , s a m e _ N N ( x i ) ) i = 1 n d ( x i , d i f f _ N N ( x i ) )
where d ( x i , s a m e _ N N ( x i ) is the distance of x i to its nearest neighbor of the same class and d ( x i , d i f f _ N N ( x i ) is the distance of x i to its nearest enemy. N2 is computed as:
N 2 = 1 1 1 + i n t r a _ e x t r a = i n t r a _ e x t r a 1 + i n t r a _ e x t r a

2.4.3. Error Rate of the Nearest Neighbor Classifier (N3)

This measure calculates the global error of the nearest neighbor classifier as:
N 3 = i = 1 n I ( N N ( x i ) c l a s s ( x i ) ) n
The main issue with this measure is that it is biased towards the majority class and is unsuitable for imbalanced data.

2.4.4. Non-Linearity of the Nearest Neighbor Classifier (N4)

This measure is similar to L3 but uses NN instead of the linear classifier, as:
N 4 = 1 l i = 1 l I ( N N T ( x i ) c l a s s ( x i ) )
where l is the number of interpolated instances.
The main problem with this instance is that it is unsuitable for hybrid and incomplete data (due to data interpolation).

2.4.5. Fractions of Hyperspheres Covering Data (T1)

The T1 measure aims to capture the topological structure of the data by recursively computing the hyperspheres needed to cover the data, without instances of different classes in the same hypersphere, as:
T 1 = # H y p e r s p h e r e s ( T ) n
where # H y p e r s p h e r e s ( T ) is the number of hyperspheres needed to cover the data.
The T1 measure assumes data is numeric and complete and has a recursive procedure for computing the hyperspheres with high computational complexity.

2.4.6. Local Set Average Cardinality (LSC)

LSC is a measure of local characterization. It considers the number of instances of the same class surrounding an instance. Then, it considers the average surrounding.
LSC = 1 1 n 2 i = 1 .. n | L S ( x i ) | LS ( x i ) = { x j | d ( x i , x j ) < d ( x i , d i f f _ N N ( x i ) ) }
This measure can complement N1 and L1 by also revealing the narrowness of the between-class margin [27]. It handles multiclass data, and its ability to handle hybrid and incomplete data will depend on the dissimilarity function used.
Table 1 summarizes the reviewed data complexity measures, considering their proportion, boundaries, computational complexity, and ability to deal with multiclass, hybrid, and incomplete data.
As shown, Feature-based measures do not handle multiclass nor hybrid and incomplete data, as well as Linearity measures. On the other hand, some Neighborhood measures are applicable for such scenarios if using appropriate dissimilarity functions, while others (N3, N4, and T1) do not.
It is important to mention that with the increase in the applicability of machine-learning techniques, the need for validating such techniques has also increased [29]. To such end, formal methods have been developed [30]. Data complexity measures are between two phases of the machine learning cycle: data understanding and data preparation (Figure 2). Depending on the complexity assessment, data preparation techniques can be used more wisely.
Unfortunately, to the best of our knowledge, no formal method has been developed to deal with the task of assessing data complexity, and it remains a gap in the scientific research on the topic.

3. Proposed Approach and Results

Our hypothesis is that extending data complexity measures for the multiclass scenario, having hybrid and incomplete data, is possible (Figure 3).
In the following, we describe the proposed extended measures. It is important to mention that all proposed measures are able to deal with multiclass, hybrid, and incomplete data.

3.1. Extended Feature-Based Measures

3.1.1. Extended Maximum Fisher’s Discriminant Ratio (F1_ext)

We wanted to maintain the idea behind F1 as a way of assessing the discriminant power of individual features. It is given by:
F 1 _ ext = 1 1 + m a x i = 1 m r i r i _ n u m = j = 1 .. c [ | { x U | c l a s s ( x ) = j } | ( μ j i μ i ) 2 ] j = 1 .. c { x U | c l a s s ( x ) = j } ( x [ i ] μ j i ) 2 r i _ c a t = | o v e r l a p _ c a t ( A i ) | | r a n g e _ c a t ( A i ) | overlap _ cat ( A i ) = { v :   x , y U , v = x [ i ] = y [ i ] ? c l a s s ( x ) c l a s s ( y )   } range _ cat ( A i ) = { v :   x U , x [ i ] = v   x ? }
For numerical features, μ i is the mean of feature A i and μ j i is the mean of feature A i considering only the instances in { x U | c l a s s ( x ) = j } . Both means are computed, disregarding the instances with missing values (?).
Our definition of overlap for categorical feature values extends [28] by enumerating values appearing in different classes and disregarding missing values, and our definition of range considers all possible values of feature Ai. Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .
The proposed measure is able to deal with multiclass hybrid and incomplete data, presenting an advance for data complexity analysis. However, as its predecessor F1, for numerical data, our measure assumes that the linear boundary is unique and perpendicular to one of the feature axes. If a feature is separable but with more than one line, it does not capture such information (Figure 4).

3.1.2. Extended Volume of the Overlapping Region (F2_ext)

We extended the original F2 measure by using different overlapping ranges for numeric and categorical data. For numeric data, our formulation is close to the original but with extensions.
F 2 _ ext = m i = 1 .. m r i
where:
r i = { r i _ c a t if   feature   i   is   categorical r i _ n u m if   feature   i   is   numeric r i _ n u m = | o v e r l a p _ n u m ( A i ) | | r a n g e _ n u m ( A i ) | = max { 0 , min max ( A i )   max min ( A i ) } max max ( A i )     min min ( A i ) min max ( A i ) = min j = 1 .. c   max x U { x [ i ] ? | x c l a s s ( x ) = j } max min ( A i ) = max j = 1 .. c   min x U { x [ i ] ? | x c l a s s ( x ) = j } max   max ( A i ) = max j = 1 .. c   max x U { x [ i ] ? | x c l a s s ( x ) = j } min   min ( A i ) = min j = 1 .. c   min x U { x [ i ] ? | x c l a s s ( x ) = j } r   i _ c a t = | o v e r l a p _ c a t ( A i ) | | r a n g e _ c a t ( A i ) |
Both minimum and maximum values are computed, disregarding the instances with missing values (?). Our formulation solves the problem of dealing with multiple classes, as well as with missing and hybrid data. Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .
As pointed out by Lorena [27], the F2 value can become very small depending on the number of operands in Equation (17); that is, it is highly dependent on the number of features a dataset has. Our extension does not avoid this situation. It has the same problems as F1_ext (Figure 5).

3.1.3. Extended Maximum Individual Feature Efficiency (F3_ext)

We inspire in the extension to the F3 measure in [28], but we maintain the idea of [27] to the measure providing lower values for simpler problems. Our proposal is as follows:
F 3 _ ext = m i n i = 1 m n o (   A i   ) n
where n o (   A i   ) is the number of overlapping instances according to A i , as:
n o (   A i   ) = { | o v e r l a p _ c a t ( A i ) | if   feature   i   is   categorical | i n t e r s e c t ( A i ) | if   feature   i   is   numeric i n t e r s e c t ( A i ) = { n i f   min max ( A i ) min max ( A i )   { x | ( ( x [ i ] > m a x m i n ( A i ) ( x [ i ] < m i n m a x ( A i ) ) } otherwise
where I is the indicator function.
For this measure, our formulation solves the problem of dealing with multiple classes, as well as with missing and hybrid data. In addition, it solves the F3 drawback of not penalizing attributes having the same (or very similar) values for all instances (Figure 6).
Using an efficient implementation, the computational complexity of our proposal is bounded by O ( m n ) .

3.2. Linearity Measures

Linearity measures are based on the idea of designing a hyperplane able to separate decision classes. Due to the fact that we are working with categorical data and with incomplete data, there is no direct way of using the notions of planes with such data. For future works, we will be working with other topological ideas resembling Linearity and able to deal with hybrid and incomplete data.

3.3. Neighborhood Measures

To extend Neighborhood measures, we propose using a dissimilarity function able to deal with mixed and incomplete data, such as HEOM [32]. We also propose using the normalized version (NHEOM) to guarantee the distance function to return values in [0, 1]. Let m a x i and m i n i be the maximum and minimum values of the numeric attribute A i . The NHEOM is as follows:
N H E O M ( x , y ) = Σ i = 1 m d i s s i ( x [ i ] ,   y [ i ] ) 2 m d i s s i ( x [ i ] ,   y [ i ] ) = { 1 if   x [ i ] = ?   y [ i ] = ? o v e r l a p ( x [ i ] ,   y [ i ] ) if   A i is   categorical r d _ d i s s ( x [ i ] ,   y [ i ] ) if   A i is   numeric o v e r l a p ( x [ i ] ,   y [ i ] ) = { 0 if   x [ i ] = y [ i ] 1 otherwise r d _ d i s s ( x [ i ] ,   y [ i ] ) = | x [ i ] y [ i ] | m a x i m i n i
As shown in Equation (21), the NHEOM function operates by attribute, and for each attribute, it uses one of three cases: for missing values, it returns one. For complete categorical values, the overlap function considers values similar only if they are equal, and for numerical values, the rd_diss function compares them by considering their difference with respect to the maximum difference between values. The normalization procedure (dividing by the square root of the number of attributes) guarantees NHEOM to be in [0, 1].
We maintain the original formulation of N1, N2, and LSC measures, just by using a dissimilarity function able to deal with hybrid and incomplete data. Regarding the N3 measure, to avoid bias toward the majority class, we change the formulation by considering average by class error.
N 3 _ ext = 1 c i = 1 c x U |   c l a s s ( x ) = d i I ( N N ( x ) c l a s s ( x ) ) | { x U |   c l a s s ( x ) = d i } |
This formulation solves the bias by considering the errors for each decision class (Figure 7). It maintains the ability to handle multiclass, hybrid, and incomplete data.
Due to the difficulties of interpolation in hybrid and incomplete data, we chose not to extend the N4 measure. Similarly, the T1 measure was not considered because of the impossibility of computing hyperspheres with categorical data.

4. Discussion

For discussion, we first analyze the behavior of the proposed measures over synthetic data (Section 4.1), and we compute the measures over publicly available datasets (Section 4.2). All experiments were executed in a Lenovo ThinkPad X1 laptop, with Windows 10 operating system, Intel(R) Core(TM) i7-8550U CPU at 1.80 GHz and 16 GB of RAM. The laptop was not exclusively dedicated to the experiments (all were executed on low priority).

4.1. Synthetic Data

We first supply three explanatory examples for the computation of the proposed measures. The first two of them use two-dimensional datasets, with no missing values (Figure 8 and Figure 9), to be able to visualize the data distribution, and the second example consists of a synthetic hybrid and incomplete dataset, adapted from the well-known play tennis dataset (Figure 10). The results of the data complexity measures over the example datasets are shown in Table 2.

4.2. Real Data

In this section, we compute the measures over publicly available multiclass hybrid and incomplete datasets. We add the proposed measures to the EPIC software [25,26]. Table 3 summarizes the measures used in the experiments, clarifying their type, proportion, boundaries, computational complexity, and whether or not it is a newly proposed method.
We selected 15 datasets publicly available in the KEEL repository [33]. All datasets correspond to real-life hybrid and incomplete problems (Table 4); all of them are partitioned using stratified five-fold cross-validation.
We also provide the non-error rate (NER) results for the Nearest Neighbor classifier for each dataset. NER is computed as [34]:
N E R = Σ g = 1 G S n g G
where
S n g = c g g n g
The NER measure assumes a confusion matrix of g classes (Figure 11), and it is robust for imbalanced data.
Table 5 presents the results of the data complexity measures’ computation for Feature-based and Neighborhood measures, and Table 6 shows the execution time (in milliseconds). The complex dataset according to each measure is highlighted in bold.
As shown, the hardest dataset is marketing. This complexity is shown in the low values of the non-error rate obtained in Table 4. F2_ext measure offers little information, being close to zero for 14 of the studied datasets. F3_ext measure points out that for most datasets, there is at least one attribute with low overlapping for 12 of the analyzed datasets.
The datasets with no clear separation are horse-colic, mammographic, and wisconsin. N1, N2, and N3_ext measures correlate well with the results of the Nearest Neighbor classifier (Table 4), while LSC shows near one value for all datasets.
The Feature-based measures are very fast, in contrast with the Neighborhood-based measures, which depend on the computation of the dissimilarity matrix between instances. For marketing and mushroom datasets, Neighborhood measures took up to two minutes. However, it is important to mention that we used a sequential implementation of the dissimilarity computation, and such time can be significantly diminished with parallel computation.
In addition, for practical purposes, when we want to assess the complexity of a given dataset, we can compute the measures in less than three minutes for the biggest ones. We think this timeframe is suitable for real-world data, and the proposed measures are computationally feasible, even with a sequential implementation.
The limitations of the proposed measures are as follows:
  • The F1_ext measure, as its predecessor F1, for numerical data assumes that the linear boundary is unique and perpendicular to one of the feature axes. If a feature is separable but with more than one line, it does not capture such information.
  • F2_ext measure can become very small depending on the number of operands in Equation (17); that is, it is highly dependent on the number of features a dataset has.
  • Neighborhood measures are bounded by O ( m n 2 ) . For datasets with a huge number of instances, they can be computationally expensive.

5. Conclusions

Our hypothesis related to the fact that extending data complexity measures for the multiclass scenario, having hybrid and incomplete data is possible, has been verified. We have introduced four data complexity measures for multiclass classification problems. All of the proposed measures are able to deal with hybrid (numeric and categorical) and missing data. This will allow knowing the complexity of a complex dataset in advance before using it to train a classifier. We included the proposed measures in the EPIC software [25,26], and we computed the measures for some of the publicly available datasets with satisfactory results.
In experiments with real datasets, it has been found that the Feature-based measures are very fast, in contrast with the Neighborhood-based measures, which depend on the computation of the dissimilarity matrix between instances. Specifically, when performing the experiments on the marketing and mushroom datasets, Neighborhood measures took up to two minutes, while only fractions of a second were invested in the others.
In future work, we want to work with topological ideas resembling Linearity with the intent to design new Linearity-based measures to deal with hybrid and incomplete data. In addition, we want to solve the issue of the F2_ext measure being severely affected by the curse of dimensionality.
Also, in the case of Neighborhood-based measures, we will implement them using parallel computation. This is to reduce the computation time required to obtain the matrix between instances.
A very relevant future work consists of carrying out a deeper analysis of all the complexity measures available in the current state-of-the-art. Then we will apply formal mathematical methods to specify, build, and verify software and hardware systems focused on their application to machine learning solutions. To do this, we will rely heavily on research papers that clearly explain the phases of machine learning and the available formal methods to verify each phase [30].

Author Contributions

Conceptualization, F.J.C.-U. and Y.V.-R.; methodology, Y.V.-R.; software, Y.V.-R.; formal analysis, C.Y.-M. and M.L.; investigation, F.J.C.-U.; data curation, F.J.C.-U.; writing—original draft preparation, F.J.C.-U. and Y.V.-R.; writing—review and editing, C.Y.-M. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All used real datasets are available at https://www.mdpi.com/ethics (accessed on 1 December 2022).

Acknowledgments

The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, Comisión de Operación y Fomento de Actividades Académicas, Secretaría de Investigación y Posgrado, Centro de Investigación en Computación, and Centro de Innovación y Desarrollo Tecnológico en Cómputo), the Consejo Nacional de Ciencia y Tecnología, and Sistema Nacional de Investigadores for their economic support to developing this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shetty, S.H.; Shetty, S.; Singh, C.; Rao, A. Supervised Machine Learning: Algorithms and Applications. In Fundamentals and Methods of Machine and Deep Learning: Algorithms, Tools and Applications; Singh, P., Ed.; Wiley: Hoboken, NJ, USA, 2022; pp. 1–16. [Google Scholar]
  2. Satinet, C.; Fouss, F. A Supervised Machine Learning Classification Framework for Clothing Products’ Sustainability. Sustainability 2022, 14, 1334. [Google Scholar] [CrossRef]
  3. Eastvedt, D.; Naterer, G.; Duan, X. Detection of faults in subsea pipelines by flow monitoring with regression supervised machine learning. Process Saf. Environ. Prot. 2022, 161, 409–420. [Google Scholar] [CrossRef]
  4. Liu, X.; Lu, D.; Zhang, A.; Liu, Q.; Jiang, G. Data-Driven Machine Learning in Environmental Pollution: Gains and Problems. Environ. Sci. Technol. 2022, 56, 2124–2133. [Google Scholar] [CrossRef] [PubMed]
  5. Voulgari, I.; Stouraitis, E.; Camilleri, V.; Karpouzis, K. Artificial Intelligence and Machine Learning Education and Literacy: Teacher Training for Primary and Secondary Education Teachers. In Handbook of Research on Integrating ICTs in STEAM Education; IGI Global: Hershey, PA, USA, 2022; pp. 1–21. [Google Scholar]
  6. Aksoğan, M.; Atici, B. Machine Learning applications in education: A literature review. In Education & Science 2022; EFE Academy: Jaipur, India, 2022; p. 27. [Google Scholar]
  7. Rezapour, M.; Hansen, L. A machine learning analysis of COVID-19 mental health data. Sci. Rep. 2022, 12, 14965. [Google Scholar] [CrossRef]
  8. Aitzaouiat, C.E.; Latif, A.; Benslimane, A.; Chin, H.-H. Machine Learning Based Prediction and Modeling in Healthcare Secured Internet of Things. Mob. Netw. Appl. 2022, 27, 84–95. [Google Scholar] [CrossRef]
  9. Alanazi, A. Using machine learning for healthcare challenges and opportunities. Inform. Med. Unlocked 2022, 30, 100924. [Google Scholar] [CrossRef]
  10. Hu, Q.; Yu, D.; Xie, Z. Neighborhood classifiers. Expert Syst. Appl. 2008, 34, 866–876. [Google Scholar] [CrossRef]
  11. Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
  12. Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Umar, A.M.; Linus, O.U.; Arshad, H.; Kazaure, A.A.; Gana, U.; Kiru, M.U. Comprehensive review of artificial neural network applications to pattern recognition. IEEE Access 2019, 7, 158820–158846. [Google Scholar] [CrossRef]
  13. Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar] [CrossRef]
  14. Yáñez-Márquez, C.; López-Yáñez, I.; Aldape-Pérez, M.; Camacho-Nieto, O.; Argüelles-Cruz, A.J.; Villuendas-Rey, Y. Theoretical foundations for the alpha-beta associative memories: 10 years of derived extensions, models, and applications. Neural Process. Lett. 2018, 48, 811–847. [Google Scholar] [CrossRef]
  15. Martínez-Trinidad, J.F.; Guzmán-Arenas, A. The logical combinatorial approach to pattern recognition, an overview through selected works. Pattern Recognit. 2001, 34, 741–751. [Google Scholar] [CrossRef]
  16. Wolpert, D.H. The supervised learning no-free-lunch theorems. In Soft Computing and Industry; Springer: London, UK, 2002; pp. 25–42. [Google Scholar]
  17. Luengo, J.; Herrera, F. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl. Inf. Syst. 2015, 42, 147–180. [Google Scholar] [CrossRef]
  18. Ma, Y.; Zhao, S.; Wang, W.; Li, Y.; King, I. Multimodality in meta-learning: A comprehensive survey. Knowl.-Based Syst. 2022, 250, 108976. [Google Scholar] [CrossRef]
  19. Huisman, M.; Van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
  20. Camacho-Urriolagoitia, F.J.; Villuendas-Rey, Y.; López-Yáñez, I.; Camacho-Nieto, O.; Yáñez-Márquez, C. Correlation Assessment of the Performance of Associative Classifiers on Credit Datasets Based on Data Complexity Measures. Mathematics 2022, 10, 1460. [Google Scholar] [CrossRef]
  21. Cano, J.-R. Analysis of data complexity measures for classification. Expert Syst. Appl. 2013, 40, 4820–4831. [Google Scholar] [CrossRef]
  22. Barella, V.H.; Garcia, L.P.; de Souto, M.C.; Lorena, A.C.; de Carvalho, A.C. Assessing the data complexity of imbalanced datasets. Inf. Sci. 2021, 553, 83–109. [Google Scholar] [CrossRef]
  23. Ho, T.K.; Basu, M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 289–300. [Google Scholar]
  24. Bello, M.; Nápoles, G.; Vanhoof, K.; Bello, R. Data quality measures based on granular computing for multi-label classification. Inf. Sci. 2021, 560, 51–67. [Google Scholar] [CrossRef]
  25. Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Camacho-Nieto, O.; Yáñez-Márquez, C. Experimental platform for intelligent computing (EPIC). Comput. Y Sist. 2018, 22, 245–253. [Google Scholar] [CrossRef]
  26. Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Nieto, O.C.; Rey-Benguría, C.F. A New Experimentation Module for the EPIC Software. Res. Comput. Sci. 2018, 147, 243–252. [Google Scholar] [CrossRef]
  27. Lorena, A.C.; Garcia, L.P.; Lehmann, J.; Souto, M.C.; Ho, T.K. How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. (CSUR) 2019, 52, 1–34. [Google Scholar] [CrossRef] [Green Version]
  28. Cummins, L. Combining and Choosing Case Base Maintenance Algorithms; University College Cork: Cork, Ireland, 2013. [Google Scholar]
  29. Seshia, S.A.; Sadigh, D.; Sastry, S.S. Toward verified artificial intelligence. Commun. ACM 2022, 65, 46–55. [Google Scholar] [CrossRef]
  30. Krichen, M.; Mihoub, A.; Alzahrani, M.Y.; Adoni, W.Y.H.; Nahhal, T. Are Formal Methods Applicable To Machine Learning And Artificial Intelligence? In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; pp. 48–53. [Google Scholar]
  31. Cios, K.J.; Swiniarski, R.W.; Pedrycz, W.; Kurgan, L.A. The knowledge discovery process. In Data Mining; Springer: Boston, MA, USA, 2007; pp. 9–24. [Google Scholar]
  32. Wilson, D.R.; Martinez, T.R. Improved heterogeneous distance functions. JAIR 1997, 6, 1–34. [Google Scholar] [CrossRef]
  33. Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  34. Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate comparison of classification performance measures. Chemom. Intell. Lab. Syst. 2018, 174, 33–44. [Google Scholar] [CrossRef]
Figure 1. Different scenarios in supervised classification while considering a decision label.
Figure 1. Different scenarios in supervised classification while considering a decision label.
Sustainability 15 01995 g001
Figure 2. Placement of data complexity assessment in the CRISP-DM-KD cycle as in [31].
Figure 2. Placement of data complexity assessment in the CRISP-DM-KD cycle as in [31].
Sustainability 15 01995 g002
Figure 3. Proposed approach for developing novel data complexity measures.
Figure 3. Proposed approach for developing novel data complexity measures.
Sustainability 15 01995 g003
Figure 4. Example of instances linearly separable (a) by one line and (b) by three lines. Note the values of F1_ext are 0.14 and 0.24, respectively.
Figure 4. Example of instances linearly separable (a) by one line and (b) by three lines. Note the values of F1_ext are 0.14 and 0.24, respectively.
Sustainability 15 01995 g004
Figure 5. Example of the curse of the dimensionality for F2 measure. (a) Dataset with two overlapping instances. Assume all attributes have the same values. (b) Results for one attribute. (c) Results for two attributes. (d) Results for ten attributes. Note the values of F2_ext values rapidly decrease as the number of attributes increase, even though are only two overlapping instances in all cases.
Figure 5. Example of the curse of the dimensionality for F2 measure. (a) Dataset with two overlapping instances. Assume all attributes have the same values. (b) Results for one attribute. (c) Results for two attributes. (d) Results for ten attributes. Note the values of F2_ext values rapidly decrease as the number of attributes increase, even though are only two overlapping instances in all cases.
Sustainability 15 01995 g005
Figure 6. The drawback of F3 is solved by F3_ext. We use a two-dimensional dataset, having zero for all instances in the A 2 attribute, with two overlapping instances. Note the values of F3 and F3_ext are 0.00 and 0.50, respectively.
Figure 6. The drawback of F3 is solved by F3_ext. We use a two-dimensional dataset, having zero for all instances in the A 2 attribute, with two overlapping instances. Note the values of F3 and F3_ext are 0.00 and 0.50, respectively.
Sustainability 15 01995 g006
Figure 7. Solving the N3 bias towards the majority class with the new formulation. (a) Balanced dataset with ten instances, two of them misclassified. (b) Imbalanced dataset with ten instances, two of them misclassified. Note how the proposed measure considers there is a full class misclassified in (b).
Figure 7. Solving the N3 bias towards the majority class with the new formulation. (a) Balanced dataset with ten instances, two of them misclassified. (b) Imbalanced dataset with ten instances, two of them misclassified. Note how the proposed measure considers there is a full class misclassified in (b).
Sustainability 15 01995 g007
Figure 8. Iris2D dataset. Note that classes are compact, balanced, and fully separated.
Figure 8. Iris2D dataset. Note that classes are compact, balanced, and fully separated.
Sustainability 15 01995 g008
Figure 9. Clover dataset. Classes are separated but imbalanced and with disjoints. The dataset resembles a clover flower.
Figure 9. Clover dataset. Classes are separated but imbalanced and with disjoints. The dataset resembles a clover flower.
Sustainability 15 01995 g009
Figure 10. Tennis dataset. This is a modified version of the well-known tennis dataset by Quinlan. Note that it has hybrid numeric and categorical data with missing values.
Figure 10. Tennis dataset. This is a modified version of the well-known tennis dataset by Quinlan. Note that it has hybrid numeric and categorical data with missing values.
Sustainability 15 01995 g010
Figure 11. Confusion matrix of c classes.
Figure 11. Confusion matrix of c classes.
Sustainability 15 01995 g011
Table 1. Description of the reviewed data complexity measures.
Table 1. Description of the reviewed data complexity measures.
MeasureProportionBoundariesComplexity [27]MulticlassHybridMissing
Values
F1Direct(0, 1] O ( m n ) Yes *NoNo
F2Direct(0, 1] O ( m n c ) NoNoNo
F3Direct[0, 1] O ( m n c ) NoNoNo
L1Direct[0, 1) O ( n 2 ) NoNoNo
L2Direct[0, 1] O ( n 2 ) NoNoNo
L3Direct[0, 1] O ( n 2 + m l c ) NoNoNo
N1Direct[0, 1] O ( m n 2 ) YesYes +Yes +
N2Direct[0, 1) O ( m n 2 ) YesYes +Yes +
N3Direct[0, 1] O ( m n ) YesNoNo
N4Direct[0, 1] O ( m n l ) YesNoNo
T1Direct[0, 1] O ( m n 2 ) YesNoNo
LSCDirect [ 0 ,   1 1 / n ] O ( m n 2 ) YesYes+Yes +
* Some formulations admit multiclass, as in [27]. + Depending on the dissimilarity function used.
Table 2. Results of the data complexity measures for synthetic datasets.
Table 2. Results of the data complexity measures for synthetic datasets.
MeasureIris2DCloverTennis
F1_ext0.0590.9800.667
F2_ext0.0000.3740.007
F3_ext0.0000.6240.200
N10.0200.0540.286
N20.0230.2610.555
N3_ext0.0000.1020.267
LSC0.5040.9380.893
Table 3. Description of the proposed data complexity measures and each measure’s ability to deal with hybrid and incomplete data.
Table 3. Description of the proposed data complexity measures and each measure’s ability to deal with hybrid and incomplete data.
TypeMeasureProportionBoundariesComplexityNew
Feature-basedF1_extDirect[0, 1] O ( m n ) Yes
F2_extDirect[0, 1] O ( m n ) Yes
F3_extDirect[0, 1] O ( m n ) Yes
Neighborhood-basedN1Direct[0, 1] O ( m n 2 ) No
N2Direct[0, 1] O ( m n 2 ) No
N3_extDirect[0, 1] O ( m n 2 ) Yes
LSCDirect [ 0 ,   1 1 / n ] O ( m n 2 ) No
Table 4. Description of the real datasets used.
Table 4. Description of the real datasets used.
NameNumeric AttributesCategorical AttributesInstancesClasses%MissingNon-Error Rate
automobile1510205626.830.44
bands190539232.280.63
breast0928623.150.66
cleveland13030351.980.36
crx6969025.360.58
dermatology34036662.190.96
hepatitis190155248.390.60
horse-colic815368298.100.53
housevotes016435246.670.93
mammographic50961213.630.75
marketing1308993923.540.23
mushroom0228124230.531.00
post-operative089033.330.37
saheart8146220.000.51
wisconsin9069922.290.93
Table 5. Results for Feature-based and Neighborhood and Dimensionality measures.
Table 5. Results for Feature-based and Neighborhood and Dimensionality measures.
NameF1_extF2_extF3_extN1N2N3_extLSC
automobile0.5870.0000.0000.5460.5280.5200.991
bands0.7570.0000.0090.4150.4500.3680.993
breast0.4900.0000.0090.3110.4640.2920.991
cleveland0.7350.0000.0000.5880.5460.6010.993
crx0.5250.0000.0040.3920.3960.3940.994
dermatology0.0560.0000.0000.0540.3430.0400.874
hepatitis0.5600.0000.0000.2770.3640.2840.956
horse-colic0.1780.0000.3700.4190.2670.3930.993
housevotes0.8890.0000.0110.0880.3410.0720.893
mammographic0.9110.0010.6700.2520.3450.2540.993
marketing0.8040.6440.0000.7280.7060.7241.000
mushroom0.7330.0000.0000.0020.2690.0000.876
post-operative0.7270.0000.0290.4080.4890.4530.987
saheart0.8180.0630.0050.4060.4750.4240.988
wisconsin0.3250.1610.1410.0650.2800.0610.747
Table 6. Execution time (in milliseconds) to compute the data complexity measures.
Table 6. Execution time (in milliseconds) to compute the data complexity measures.
NameF1_extF2_extF3_extN1N2N3_extLSC
automobile2.4003.6002.8001056.0001055.400949.4001120.000
bands2.2000.00010.2004035.2004407.0004164.6003746.800
breast0.0000.0000.000161.200204.000188.000303.200
cleveland1.2000.0002.0001297.4001374.6001458.2001402.200
crx1.4000.0002.4002836.6003269.6003329.4003130.400
dermatology1.2000.0004.6001490.2001491.6001528.4001467.600
hepatitis0.6000.0005.200609.400526.200539.400631.400
horse-colic0.8000.2001.6001276.8001249.6001316.0001555.400
housevotes0.8000.0000.800849.600879.400879.000808.600
mammographic1.0000.0002.0003524.2003600.6003611.4003371.200
marketing35.2005.60067.600664,776.000726,545.800729,344.000646,769.200
mushroom9.8004.8009.200123,978.800129,301.400122,140.800122,648.000
post-operative0.0000.0000.00027.80020.40027.40024.800
saheart0.6000.0000.800547.000531.400550.000542.600
wisconsin1.4000.0002.8003440.4003394.0003137.6002905.400
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Camacho-Urriolagoitia, F.J.; Villuendas-Rey, Y.; Yáñez-Márquez, C.; Lytras, M. Novel Features and Neighborhood Complexity Measures for Multiclass Classification of Hybrid Data. Sustainability 2023, 15, 1995. https://doi.org/10.3390/su15031995

AMA Style

Camacho-Urriolagoitia FJ, Villuendas-Rey Y, Yáñez-Márquez C, Lytras M. Novel Features and Neighborhood Complexity Measures for Multiclass Classification of Hybrid Data. Sustainability. 2023; 15(3):1995. https://doi.org/10.3390/su15031995

Chicago/Turabian Style

Camacho-Urriolagoitia, Francisco J., Yenny Villuendas-Rey, Cornelio Yáñez-Márquez, and Miltiadis Lytras. 2023. "Novel Features and Neighborhood Complexity Measures for Multiclass Classification of Hybrid Data" Sustainability 15, no. 3: 1995. https://doi.org/10.3390/su15031995

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop