You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

10 December 2022

Fast Training Set Size Reduction Using Simple Space Partitioning Algorithms

,
,
and
1
Department of Information and Electronic Engineering, School of Engineering, International Hellenic University, 57400 Thessaloniki, Greece
2
Department of Applied Informatics, School of Information Sciences, University of Macedonia, 156 Egnatia Street, 54636 Thessaloniki, Greece
3
Department of Digital Systems, University of the Peloponnese, Valioti’s Building, 23100 Sparta, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Computing and Embedded Artificial Intelligence

Abstract

The Reduction by Space Partitioning (RSP3) algorithm is a well-known data reduction technique. It summarizes the training data and generates representative prototypes. Its goal is to reduce the computational cost of an instance-based classifier without penalty in accuracy. The algorithm keeps on dividing the initial training data into subsets until all of them become homogeneous, i.e., they contain instances of the same class. To divide a non-homogeneous subset, the algorithm computes its two furthest instances and assigns all instances to their closest furthest instance. This is a very expensive computational task, since all distances among the instances of a non-homogeneous subset must be calculated. Moreover, noise in the training data leads to a large number of small homogeneous subsets, many of which have only one instance. These instances are probably noise, but the algorithm mistakenly generates prototypes for these subsets. This paper proposes simple and fast variations of RSP3 that avoid the computationally costly partitioning tasks and remove the noisy training instances. The experimental study conducted on sixteen datasets and the corresponding statistical tests show that the proposed variations of the algorithm are much faster and achieve higher reduction rates than the conventional RSP3 without negatively affecting the accuracy.

1. Introduction

Data reduction is a crucial pre-processing task [1] in instance-based classification [2]. Its goal is to reduce the high computational cost involved in such classifiers by reducing the training data as much as possible without penalty in classification accuracy. In effect, Data Reduction Techniques (DRTs) attempt to either select or generate a small set of training prototypes that represent the initial large training set so that the computational cost of the classifier is vastly reduced. The selected or generated set of training prototypes is called a condensing set.
DRTs can be based on either the concept of Prototype Selection (PS) [3] or the concept of Prototype Generation (PG) [4]. A PS algorithm collects representative prototypes from the initial training set, while a PG algorithm summarizes similar training instances and generates a prototype that represents them. PS and PG algorithms are based on the hypothesis that training instances far from the boundaries of the different classes, also called class decision boundaries, can be removed without penalty in classification accuracy. On the other hand, the training instances that are close to class decision boundaries are the only useful training instances in instance-based classification. In this paper, we focus on the PG algorithms.
The RSP3 algorithm [5] is a well-known parameter-free PG algorithm. Its condensing set leads to accurate and fast classifiers. However, the algorithm requires high computational cost to generate its condensing set because it is based on a recursive partitioning process that divides the training set into subsets that contain training instances of only one class, i.e., they are homogeneous. The algorithm keeps dividing each non-homogeneous subset into two new subsets and stops when all created subsets become homogeneous. The center/mean of each subset constitutes a representative prototype that replaces all instances of the subset. In each algorithm step, a subset is divided by finding the pair of its furthest instances. The instances of the initial subset are distributed to the two subsets according to their distances from those furthest instances. The pair of the furthest instances is retrieved by computing all the distances between instances and finding the pair of instances with the maximum distance. The computational cost of this task is high and may become even prohibitive in cases of very large datasets. This weak point constitutes the first motive of the present paper, namely Motive-A.
The quality and the size of the condensing set created by RSP3 depends on the degree of noise in the training data [6]. Suppose that a training instance x that belongs to class A lies in the middle of a data neighborhood with instances that belong to class B. In this case, x constitutes noise. RSP3 splits the neighborhood into multiple subsets, with one of them containing only instance x. The algorithm mistakenly considers x as a prototype and places it in the condensing set. This observation constitutes the second motive of the present work, namely Motive-B.
In this paper, we propose simple RSP3 variations which consider the two motives presented above. More specifically, this paper proposes:
  • RSP3 variations that replace the costly tasks of retrieving the pair of the furthest instances by applying simpler and faster tasks based on which each subset is divided.
  • A mechanism for noise removal. This mechanism considers each subset containing only one instance as noise and does not generate prototypes for that subset. As a result, it improves the reduction rates and the classification accuracy when it is applied on noisy training sets. The proposed mechanism can be incorporated in any of the RSP3 variations (conventional RSP3 included).
The experiments show that the proposed variations are much faster than the original version of RSP3. In most cases, accuracy is retained high, and the variations that incorporate the mechanism for noise removal improve the reduction rates and the classification accuracy, especially on noisy datasets. The experimental results are statistically validated by utilizing the Wilcoxon signed rank test and the Friedman test.
The rest of the paper is organized as follows: The recent research works in the field of PG algorithms are reviewed in Section 2. The RSPE algorithm is presented in Section 3. The new RSP3 variations are presented in detail in Section 4. Section 5 presents and discusses the experimental results and the results of the statistical tests. Section 6 concludes the paper and outlines future work.

3. The Original Rsp3 Algorithm

RSP3 is one of the three proposed RSP algorithms [5]. The three algorithms are descendants of the Chen and Jozwik algorithm (CJA) [31]. However, RSP3 is the only parameter-free RSP algorithm (CJA included) and builds the same condensing set regardless of the order of data in the training set.
RSP3 works as follows: It initially finds the pair of the furthest instances, a and b, in the training set (see Figure 1). Then, it splits the training set into two subsets, C a and C b , with the training instances assigned to their closest furthest instance. Then, in each algorithm iteration and by following the aforementioned procedure, a non-homogeneous subset is divided into two subsets. The splitting tasks stop when all created subsets become homogeneous. Then, the algorithm generates prototypes. For each created subset C, RSP3 computes its mean by averaging its training instances. The mean instance that is labeled by the class of the instances in C plays the role of a generated prototype and is placed in the condensing set.
Figure 1. The RSP3 algorithm divides the subset according to its furthest instances f i n s t 1 and f i n s t 2 .
The pseudo-code presented in Algorithm 1 is a possible non-recursive implementation of RSP3 that uses a data structure S to store subsets. In the beginning, the whole training set ( T S ) is a subset to be processed, and it is placed in S (line 2). At each iteration, RSP3 selects a subset C from S and checks whether it is homogeneous or not. If C is homogeneous, the algorithm computes its mean instance and stores it in the condensing set ( C S ) as a prototype (lines 6–9). Then, C is removed from S (line 17). If C is non homogeneous, the algorithm finds the furthest instances a and b in C (line 11) and divides C into two subsets C a and C b by assigning each instance of C to its closest furthest instance (lines 12–13). The new subsets C a and C b are added to S (lines 14–15), and C is removed from S (line 17). The loop terminates when S becomes empty (line 18), i.e., when all subsets become homogeneous.
Algorithm 1 RSP3
Input:  T S {Training Set}
Output:  C S {Condensing Set}
   1:
S Initialize structure that holds unprocessed subsets
   2:
add(S, T S )
   3:
C S Initialize C S
   4:
repeat
   5:
   C← pick a subset from S
   6:
   if C is homogeneous then
   7:
      r mean instance of C
   8:
      r . l a b e l class of instances in C
   9:
      C S C S { r } add r in the condensing set
   10:
   else
   11:
      ( a , b ) furthest instances in C Algorithm 2 is applied
   12:
      C a set of C instances closer to a
   13:
      C b set of C instances closer to b
   14:
     add(S, C a )
   15:
     add(S, C b )
   16:
   end if
   17:
   remove(S, C)
   18:
until IsEmpty(S) all subsets became homogeneous
   19:
return  C S
In the close to class decision boundaries areas, the training instances from different classes are close to each other. RSP3 creates more prototypes for those data areas, since many small homogeneous subsets are created. Similarly, more subsets are created and more prototypes are generated for noisy data areas. In effect, a subset with only one instance constitutes noise. In contrast, fewer and larger subsets are created for the “internal” data areas which are far from the decision boundaries where a class dominates.
Sanchez, in his experimental study presented in [5], showed that RSP3 generates a small condensing set. When an instance-based classifier such as k-NN utilizes the RSP3 generated condensing set, it achieves accuracy almost as high as when k-NN runs over the original training set. However, the computational cost of the classification step is significantly lower.
The retrieval of the pair of the furthest instances in each subset requires the computation of all distances between the instances of the subset. This approach is simple and straightforward. However, it is a computationally expensive task that burdens the overall pre-processing cost of the algorithm. In cases of large datasets, this drawback may render the execution of RSP3 prohibitive.
In this respect, the conventional RSP3 algorithm computes | C | × ( | C | 1 ) 2 distances in order to find the most distant instances in each subset C. Thus, for each subset division, RSP3 proceeds with the pseudo-code outlined in Algorithm 2.
Algorithm 2 The Grid algorithm
Input C {A subset containing instances i n s t 1 through i n s t | C | }
Output  D m a x , f i n s t 1 , f i n s t 2
   1:
D m a x 0
   2:
for i 1 to | C |  do
   3:
   for  j i + 1 to | C |  do
   4:
      D c u r r d i s t a n c e ( i n s t i , i n s t j )
   5:
     if  D c u r r > D m a x  then
   6:
         D m a x D c u r r
   7:
         f i n s t 1 i n s t i
   8:
         f i n s t 2 i n s t j
   9:
     end if
   10:
   end for
   11:
end for
   12:
return D m a x , f i n s t 1 , f i n s t 2
In effect, a grid of distances is computed; hence, Algorithm 2 is labeled “The Grid Algorithm”. It returns the furthest instances f i n s t 1 and f i n s t 2 in C along with their distance D m a x . Hereinafter, each reference to the “RSP3” acronym implies the RSP3 algorithm whereby the most distant instances are calculated by applying the Grid algorithm to all the instances in each subset. It is worth mentioning that RSP3 as implemented in the KEEL software [32] applies this simple and straightforward approach for finding the pair of the most distant instances in C.

4. The Proposed Rsp3 Variations

4.1. The Rsp3 with Editing (Rsp3e) Algorithm

The RSP3 with editing (RSP3E) algorithm incorporates an editing mechanism that removes noise from the training data. RSP3E is almost identical to the conventional RSP3. However, it involves a major difference: If a subset with only one instance is created, this subset is considered to be noise. In effect, such an instance is surrounded by instances that belong to different classes. The algorithm does not proceed with the prototype generation for this subset. Therefore, for each subset containing only one instance, RSP3E does not generate a prototype in the condensed set. RSP3E addresses Motive-B (defined in Section 1). RSP3E has been inspired by the idea first adopted by ERHC [13] and EHC [33].

4.2. The Rsp3-Rnd and Rsp3e-Rnd Algorithms

As already explained, RSP3 finds the pair of the furthest instances in each subset in order to divide it. Likely, the most distant instances in a subset belong to different classes. By splitting the subset using such instances, the probability of creating two large homogeneous subsets is higher. Thus, RSP3 may need fewer iterations in order to divide the whole training set into homogeneous subsets, and the reduction rates may be higher.
The RSP3-RND and RSP3E-RND algorithms were inspired by the following observation: RSP3 can run and produce a condensing set even if it selects any pair of instances instead of the pair of the furthest instances. In that case, the algorithm will likely need more subset divisions and the data reduction rate will be lower. However, the procedure of subset division will be much faster, since the costly retrieval of the furthest instances will be avoided. This simple idea is adopted by RSP3-RND and RSP3E-RND that work similarly to RSP3 and RSP3E, respectively, but they randomly select the pair of instances used for subset division.
RSP3-RND and RSP3E-RND will generate different condensing sets in different executions. In other words, the number of divisions and the generated prototypes depend on the selection of the random pairs of instances. RSP3-RND addresses Motive-A, while RSP3E-RND addresses both motives.

4.3. The Rsp3-M and Rsp3e-M Algorithms

The RSP3-M and RSP3E-M algorithms are two more simple variations of RSP3 and RSP3E, respectively. Both work as follows: Initially, the algorithms find the two top classes in terms of instances belonging to them. These classes are called the c o m m o n classes. The mean instances of the common classes constitute the pair of instances based on which a non-homogeneous set is divided into two subsets (see Figure 2).
Figure 2. The RSP3-M and RSP3E-M algorithms divide a subset according to the means of the two most common classes in the subset m 1 and m 2 .
Obviously, similar to RPS3-RND and RSP3E-RND, RSP3-M addresses Motive-A while RSP3E-M addresses both motives. In effect, RSP3-M and RSP3E-M speed up the original algorithm because they replace the computation of the furthest instances of a set with the computation of the two common classes of the set and the corresponding mean instances, which is a much faster approach. The idea behind RSP3-M and RSP3E-M is quite simple: By dividing a non-homogeneous set into two subsets based on the means of the most common classes in the subset, it is more probable for the algorithms to earlier obtain large homogeneous subsets. We expect that both RSP3-M and RSP3E-M will increase the reduction rates at a maximum level. However, this may negatively affect accuracy.

4.4. The Rsp3-M2 and Rsp3e-M2 Algorithms

The RSP3-M2 and RSP3E-M2 algorithms are almost identical to RSP3-M and RSP3E-M, respectively. The only difference is that instead of using the generated mean instances of the most common classes in order to divide a non-homogeneous subset, RSP3-M2 and RSP3E-M2 identify and use the training instances that are closer to the mean instances (see Figure 3). We expect that this may reduce the reduction rates, and as a result, the accuracy achieved by RSP3-M2 and RSP3E-M2 will be higher compared to that of RSP3-M and RSP3E-M.
Figure 3. The RSP3-M2 and RSP3E-M2 algorithms divide a subset according to the instances that are closer to the means of the two most common classes in the subset m 1 and m 2 .

5. Experimental Study

5.1. Experimental Setup

The original RSP3 algorithm and its proposed variations were coded in C++. Moreover, we include the results of the NOP approach (no data reduction) for comparison purposes. The experiments were conducted on a Debian GNU/Linux server equipped with a 12-core CPU with 64 GB of RAM. The experimental results were measured by running the k-NN classifier (with k = 1) over the original training set (case of NOP classifier) and the condensing sets generated by the conventional RSP3 algorithm and its proposed variations. The k parameter value is the only parameter used in the experimental study. Following the common practice in the field of data reduction for instance-based classifiers, we used the setting k = 1.
We used 16 datasets distributed by the KEEL [34] and UCI machine learning [35] repositories, whose main characteristics are summarized in Table 1. Each dataset’s attribute values were normalized to the range [0, 1], and we used the Euclidean distance as a similarity measure. We removed all nominal and fixed-value attributes and the duplicate instances from the KDD dataset, thus reducing its size to 141,481 instances.
Table 1. Datasets characteristics.
As mentioned above, the major goal of the proposed variants of RSP3 is to minimize the computational cost needed for the condensing set construction. High reduction rates as well as keeping the accuracy at high levels are also goals. Thus, for each algorithm and dataset, we used a five-fold cross-validation schema to measure the following four metrics: (i) Accuracy (ACC), (ii) Reduction Rate (RR), (iii) Distance Computations (DC) required for the condensing set construction (in millions (M)), and, (iv) CPU time (CPU) in seconds required for the condensing set construction.

5.2. Experimental Results

Table 2 presents, for each dataset and algorithm, the ACC, RR, DC and CPU measurements. Table 3 summarizes the measurements of Table 2 and presents the average measurements as well as the standard deviation and the coefficient variance of themeasurements.
Table 2. Comparison in terms of ACC(%)), RR(%), DC(M) and CPU (Secs).
Table 3. Statistics of experimental measurements (Average (AVG), Standard Deviation (STDEV), Coefficient of Variation (CV)).
Furthermore, Figure 4, Figure 5, Figure 6 and Figure 7 present an overview of average measurements in bar diagrams. More specifically, Figure 4 depicts the average accuracy measurements computed by averaging the ACC measurements achieved by the 1-NN classifier using the condensing set generated by the algorithms. Correspondingly, Figure 5 presents the average RR measurements achieved by the algorithms on the different datasets. Figure 6 illustrates the average distance computations and Figure 7 shows the average CPU times. The diagrams presented in Figure 4, Figure 5 and Figure 6 are in linear scale, while the diagram presented in Figure 7 is in logarithmic scale.
Figure 4. Average accuracy measurements.
Figure 5. Average reduction rates measurements.
Figure 6. Average distance computations (in millions).
Figure 7. Average CPU measurements (in secs).
The results reveal that all algorithms are relatively close in terms of accuracy. However, RPS3, RSP3E, RSP3-RND, RSP3E-RND and RSP3E-M2 achieve the highest ACC measurements. Nevertheless, the high reduction rates achieved by RSP3-M, RSP3-M2 and RSP3E-M seem to negatively affect accuracy. Almost in all cases, RSP3E achieves the highest accuracy, while RSP3E-RND and RSP3-M2 follow. The results indicate that the editing mechanism incorporated by these algorithms is effective.
Concerning RR measurements, we observe in Table 2 that RSP3E-M has the highest performance. However, as mentioned above, these high reduction rates negatively affect accuracy. Furthermore, we observe that the algorithms that incorporate the editing mechanism seem to be more effective in terms of RR measurements. In particular, by removing the useless noisy instances from the data, they achieve higher RR measurements than the algorithms that do not incorporate editing and, at the same time, their accuracy is either improved or is not negatively affected.
Moreover, we can observe that the proposed RSP3 variations outperform the original RSP3, in terms of RR, DC and CPU measurements, which concern the pre-processing cost required for the condensing set construction. This happens because RSP3 computes a large number of distances. In contrast, the proposed variations divide the subsets by avoiding computationally costly procedures. As far as the large datasets are concerned (i.e., KDD, SH, LIR, MGT), the gains are extremely high. In contrast, RSP3 leads to noticeably high CPU costs. Figure 6 and Figure 7 visualize this extreme superiority in terms of pre-processing computational cost.
As far as the large datasets are concerned (i.e., KDD, SH), RSP3 leads to noticeably intensive CPU. The experimental results reveal that RSP3-M and RSP3E-M are faster than RSP3-M2 and RSP3E-M2, respectively. In addition, RSP3-M2 and RSP3E-M2 are faster than RSP3-RND and RSP3E-RND, and the latter are faster then the original RSP3 algorithm and the proposed RSP3E variant.
By observing Table 2 and Table 3 and Figure 4, Figure 5, Figure 6 and Figure 7, we observe that the variations with the editing mechanism that removes the subsets containing only one instance (i.e., RSP3E, RSP3E-RND, RSP3E-M and RSP3E-M2) achieve quite higher RR measurements when compared with the corresponding methods without the editing mechanism, and, at the same time, in most cases, they achieve higher accuracy.
Finally, the experimental results show that the RR measurements achieved by RSP3E-M are the highest. In contrast, as expected, RSP3-RND is the algorithm with the lowest reduction rates.

5.3. Statistical Comparisons

5.3.1. Wilcoxon Signed Rank Test Results

Following the common approach that is applied in the field of PS and PG algorithms [3,4,10,14,24,25,27], the experimental study is complemented with a Wilcoxon signed rank test [36]. Thus, we statistically confirm the validity of the measurements presented in Table 2. The Wilcoxon signed rank test compares all the algorithms in pairs, considering the result achieved against each dataset. We applied the Wilcoxon signed rank test using the PSPP statistical software.
As mentioned above, it is clear that RSP3-M and RSP3E-M compute fewer distances than RSP3-M2 and RSP3E-M2, respectively. Furthermore, RSP3-M2 and RSP3E-M2 compute fewer distances than RSP3-RND and RSP3E-RND, and the latter compute fewer distances than RSP3 and RSP3E. Thus, we do not run the Wilcoxon test for the DC measurements.
Table 4 presents the results of the Wilcoxon signed rank test obtained for the ACC, RR and CPU measurements. The column labeled “w/l/t” lists the number of wins, losses and ties for each comparison test. The column labeled “Wilcoxon” (last column) lists a value that quantifies the significance of the difference between the two algorithms compared. When this value is lower than 0.05, one can claim that the difference is statistically significant.
Table 4. Results of Wilcoxon signed rank test on ACC, RR and CPU measurements.
In terms of accuracy, the results show that the statistical difference between the following pairs is not significant: NOP versus RSP3E, NOP versus RSP3E-RND and NOP versus RSP3E-M2. In contrast, the statistical difference between the conventional RSP3 algorithm and NOP is significant. Thus, we can claim that the 1-NN classifier that runs over the condensing set generated by the proposed RSP3E, RSP3E-RND and RSP3E-M2 algorithms achieves as high accuracy as the 1-NN classifier that runs over the original training set. Moreover, the test shows that there is no significant difference in terms of accuracy between the original version of RSP3 and the following proposed variants: RSP3E, RSP3-RND, and RSP3E-RND.
In contrast, there is statistical difference in terms of Reduction Rates and CPU times. This means that we can obtain as high accuracy as that of the original RSP3 algorithm but with lower computational cost, while the cost of the condensing set construction is lower. Moreover, the test confirms that there is statistical difference in terms of accuracy between the pairs RSP3-M versus RSP3-M2 and RSP3E-M versus RSP3E-M2. Although there is a significant difference in terms of RR and CPU measurements, RSP3-M2 and RSP3E-M2 can be considered better. Last but not least, the test shows that RSP3E, RSP3E-RND, RSP3E-M and RSP3E-M2 dominate RSP3, RSP3-RND, RSP3-M and RSP3-M2, respectively, in terms of reduction rates, while the accuracy and the CPU times are not negatively affected.

5.3.2. Friedman Test Results

The non-parametric Friedman test was used in order to rank the algorithms. The test ranks the algorithms for each dataset separately. The best performing algorithm is ranked number 1, the second best is ranked number 2, etc. We used the Friedman test through the PSPP statistical software. The test was run three times, one for each criterion measured. Table 5 presents the results of the Friedman test obtained for the ACC, RR and CPU measurements, respectively.
Table 5. Results of Friedman test on ACC, RR and CPU measurements.
The Friedman test shows that:
  • RSP3E is the most accurate approach. RSP3E-RND, RSP3, RSP3-RND and RSP3E-M2 are the runners-up.
  • RSP3E-M and RSP3E-M2 achieve the highest RR measurements. RSP3-M and RSP3E are the runners-up.
  • RSP3E-M and RSP3-M are the fastest approaches. RSP3-M2 and RSP3E-M2 are the runners-up.

6. Conclusions

This paper proposed three RSP3 variations that aim at reducing the computational cost involved by the original RSP3 algorithm. All the proposed variations replace the costly task of finding the pair of the furthest instances in a subset by a faster procedure. The first one (RSP3-RND) selects two random instances. The second one (RSP3-M) computes and uses the means of the two most common classes in a subset. The last variation (RSP3-M2) uses the instances that are closer to the means of the two most common classes in a subset.
Moreover, the present paper proposed an editing mechanism for noise removal. The latter does not generate a prototype for each homogeneous subset that contains only one training instance. In effect, this instance is considered noise and is removed. The editing mechanism can be incorporated into any RSP3 algorithm (original RSP3 included). Therefore, in this paper, we developed and tested seven new versions of the original RSP3 PG algorithm (i.e., RSP3E, RSP3-RND, RSP3E-RND, RSP3-M, RSP3E-M, RSP3-M2, RSP3E-M2).
The experimental study as well as the Wilcoxon and Fridman tests revealed that the editing mechanism is quite effective since it removes a high number of irrelevant training instances that do not contribute in classification accuracy. Thus, the reduction rates are improved either with gains or, at least, without loss in accuracy. In addition, the results showed that RSP3-M2 is more effective than RSP3-M. Although the RSP3-RND variation is simple, it is quite accurate. This happens because the RR achieved by RSP3-RND is not very high.
In our future work, we plan to develop data reduction techniques for complex data, such as multi-label data, data in non-metric spaces and data streams.

Author Contributions

Author Contributions: Conceptualization, S.O., T.M., G.E. and D.M.; methodology, S.O.,T.M., G.E. and D.M.; software, S.O., T.M., G.E. and D.M.; validation, S.O., T.M., G.E. and D.M.; formal analysis, S.O., T.M., G.E. and D.M.; investigation, S.O., T.M., G.E. and D.M.; resources, S.O., T.M., G.E. and D.M.; data curation, S.O., T.M., G.E. and D.M.; writing—original draft preparation, S.O., T.M., G.E. and D.M.; writing—review and editing, S.O., T.M., G.E. and D.M.; visualisation, S.O., T.M., G.E. and D.M.; supervision, S.O., T.M., G.E. and D.M.; project administration, S.O., T.M., G.E. and D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://archive.ics.uci.edu/ml/ and https://sci2s.ugr.es/keel/datasets.php (accessed on 2 October 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DRTData Reduction Technique
PGPrototype Generation
PSPrototype Selection
RSP3Reduction by Space Partitioning 3
RSP3EReduction by Space Partitioning 3 with Editing
RSP3-RNDReduction by Space Partitioning 3 with Random pairs
RSP3E-RNDReduction by Space Partitioning 3 with Editing and Random pairs
RSP3-MReduction by Space Partitioning 3 using Means
RSP3E-MReduction by Space Partitioning 3 with Editing using Means

References

  1. García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Intelligent Systems Reference Library; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
  2. Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theor. 1967, 13, 21–27. [Google Scholar] [CrossRef]
  3. Garcia, S.; Derrac, J.; Cano, J.; Herrera, F. Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 417–435. [Google Scholar] [CrossRef] [PubMed]
  4. Triguero, I.; Derrac, J.; Garcia, S.; Herrera, F. A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. Trans. Syst. Man Cyber Part C 2012, 42, 86–100. [Google Scholar] [CrossRef]
  5. Sánchez, J.S. High training set size reduction by space partitioning and prototype abstraction. Pattern Recognit. 2004, 37, 1561–1564. [Google Scholar] [CrossRef]
  6. Ougiaroglou, S.; Evangelidis, G. Dealing with Noisy Data in the Context of k-NN Classification. In Proceedings of the 7th Balkan Conference on Informatics Conference, Craiova, Romania, 2–4 September 2015; ACM: New York, NY, USA, 2015; pp. 28:1–28:4. [Google Scholar] [CrossRef]
  7. Giorginis, T.; Ougiaroglou, S.; Evangelidis, G.; Dervos, D.A. Fast data reduction by space partitioning via convex hull and MBR computation. Pattern Recognit. 2022, 126, 108553. [Google Scholar] [CrossRef]
  8. Jin, X.; Han, J. K-Means Clustering. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 563–564. [Google Scholar] [CrossRef]
  9. Wu, J. Advances in K-means Clustering: A Data Mining Thinking; Springer Publishing Company, Incorporated: New York, NY, USA, 2012. [Google Scholar]
  10. Ougiaroglou, S.; Evangelidis, G. RHC: Non-Parametric Cluster-Based Data Reduction for Efficient k-NN Classification. Pattern Anal. Appl. 2016, 19, 93–109. [Google Scholar] [CrossRef]
  11. Castellanos, F.J.; Valero-Mas, J.J.; Calvo-Zaragoza, J. Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification. Soft Comput. 2021, 25, 15403–15415. [Google Scholar] [CrossRef]
  12. Valero-Mas, J.J.; Castellanos, F.J. Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning. Appl. Sci. 2020, 10, 3356. [Google Scholar] [CrossRef]
  13. Ougiaroglou, S.; Evangelidis, G. Efficient editing and data abstraction by finding homogeneous clusters. Ann. Math. Artif. Intell. 2015, 76, 327–349. [Google Scholar] [CrossRef]
  14. Gallego, A.J.; Calvo-Zaragoza, J.; Valero-Mas, J.J.; Rico-Juan, J.R. Clustering-Based k-Nearest Neighbor Classification for Large-Scale Data with Neural Codes Representation. Pattern Recogn. 2018, 74, 531–543. [Google Scholar] [CrossRef]
  15. Ougiaroglou, S.; Evangelidis, G. Efficient k-NN classification based on homogeneous clusters. Artif. Intell. Rev. 2013, 42, 491–513. [Google Scholar] [CrossRef]
  16. Ougiaroglou, S.; Evangelidis, G.; Dervos, D.A. FHC: An adaptive fast hybrid method for k-NN classification. Log. J. IGPL 2015, 23, 431–450. [Google Scholar] [CrossRef]
  17. Gallego, A.J.; Rico-Juan, J.R.; Valero-Mas, J.J. Efficient k-nearest neighbor search based on clustering and adaptive k values. Pattern Recognit. 2022, 122, 108356. [Google Scholar] [CrossRef]
  18. Impedovo, S.; Mangini, F.; Barbuzzi, D. A Novel Prototype Generation Technique for Handwriting Digit Recognition. Pattern Recogn. 2014, 47, 1002–1010. [Google Scholar] [CrossRef]
  19. Carpenter, G.A.; Grossberg, S. Adaptive Resonance Theory (ART). In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1998; pp. 79–82. [Google Scholar]
  20. Rezaei, M.; Nezamabadi-pour, H. Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 2015, 157, 256–263. [Google Scholar] [CrossRef]
  21. Rashedi, E.; Nezamabadi-pour, H.; Saryazdi, S. GSA: A Gravitational Search Algorithm. Inf. Sci. 2009, 179, 2232–2248. [Google Scholar] [CrossRef]
  22. Hu, W.; Tan, Y. Prototype Generation Using Multiobjective Particle Swarm Optimization for Nearest Neighbor Classification. IEEE Trans. Cybern. 2016, 46, 2719–2731. [Google Scholar] [CrossRef]
  23. Elkano, M.; Galar, M.; Sanz, J.; Bustince, H. CHI-PG: A fast prototype generation algorithm for Big Data classification problems. Neurocomputing 2018, 287, 22–33. [Google Scholar] [CrossRef]
  24. Escalante, H.J.; Graff, M.; Morales-Reyes, A. PGGP: Prototype Generation via Genetic Programming. Appl. Soft Comput. 2016, 40, 569–580. [Google Scholar] [CrossRef]
  25. Calvo-Zaragoza, J.; Valero-Mas, J.J.; Rico-Juan, J.R. Prototype Generation on Structural Data Using Dissimilarity Space Representation. Neural Comput. Appl. 2017, 28, 2415–2424. [Google Scholar] [CrossRef]
  26. Cruz-Vega, I.; Escalante, H.J. An Online and Incremental GRLVQ Algorithm for Prototype Generation Based on Granular Computing. Soft Comput. 2017, 21, 3931–3944. [Google Scholar] [CrossRef]
  27. Escalante, H.J.; Marin-Castro, M.; Morales-Reyes, A.; Graff, M.; Rosales-Pérez, A.; Montes-Y-Gómez, M.; Reyes, C.A.; Gonzalez, J.A. MOPG: A Multi-Objective Evolutionary Algorithm for Prototype Generation. Pattern Anal. Appl. 2017, 20, 33–47. [Google Scholar] [CrossRef]
  28. Jain, B.J.; Schultz, D. Asymmetric learning vector quantization for efficient nearest neighbor classification in dynamic time warping spaces. Pattern Recognit. 2018, 76, 349–366. [Google Scholar] [CrossRef]
  29. Silva, L.A.; de Vasconcelos, B.P.; Del-Moral-Hernandez, E. A Model to Estimate the Self-Organizing Maps Grid Dimension for Prototype Generation. Intell. Data Anal. 2021, 25, 321–338. [Google Scholar] [CrossRef]
  30. Sucholutsky, I.; Schonlau, M. Optimal 1-NN prototypes for pathological geometries. PeerJ Comput. Sci. 2021, 7, e464. [Google Scholar] [CrossRef]
  31. Chen, C.H.; Jóźwik, A. A sample set condensation algorithm for the class sensitive artificial neural network. Pattern Recogn. Lett. 1996, 17, 819–823. [Google Scholar] [CrossRef]
  32. Alcala-Fdez, J.; Sanchez, L.; Garcia, S.; del Jesus, M.J.; Ventura, S.; Guiu, J.M.G.; Otero, J.; Romero, C.; Bacardit, J.; Rivas, V.M.; et al. KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 2008, 13, 307–318. [Google Scholar] [CrossRef]
  33. Ougiaroglou, S.; Evangelidis, G. EHC: Non-parametric Editing by Finding Homogeneous Clusters. In Foundations of Information and Knowledge Systems; Beierle, C., Meghini, C., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; Volume 8367, pp. 290–304. [Google Scholar] [CrossRef]
  34. Alcalá-Fdez, J.; Fernández, A.; Luengo, J.; Derrac, J.; García, S. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. J. Multiple Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  35. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019; Available online: http://archive.ics.uci.edu/ml (accessed on 1 October 2022).
  36. Sheskin, D. Handbook of Parametric and Nonparametric Statistical Procedures; A Chapman & Hall book; Chapman & Hall/CRC: Boca Raton, FL, USA, 2011. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.