This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Feature selection, also known as attribute selection, is the technique of selecting a subset of relevant features for building robust object models. It is becoming more and more important for largescale sensors applications with AI capabilities. The core idea of this paper is derived from a straightforward and intuitive principle saying that, if a feature subset (pattern) has more representativeness, it should be more selforganized, and as a result it should be more insensitive to artificially seeded noise points. In the light of this heuristic finding, we established the whole set of theoretical principles, based on which we proposed a twostage framework to evaluate the relative importance of feature subsets, called seeding and harvest (S&H for short). At the first stage, we inject a number of artificial noise points into the original dataset; then at the second stage, we resort to an outlier detector to identify them under various feature patterns. The more precisely the seeded points can be extracted under a particular feature pattern, the more valuable and important the corresponding feature pattern should be. Besides, we compared our method with several stateoftheart feature selection methods on a number of reallife datasets. The experiment results significantly confirm that our method can accomplish feature reduction tasks with high accuracy as well as low computing complexity.
There are more and more sensor applications requiring artificial intelligence (AI), machine learning and data mining technologies to identify new, potential and useful knowledge from datasets [
As illustrated in
Instancebased data reduction methods like various sampling techniques have been studied thoroughly [
Refer to the third column of
According to what kind of evaluator has been adopted, a feature selection methodology can be further categorized into a wrapper or a filter, which are distinct from each other in whether a specific AI algorithm is required as the measure of relative importance of different feature subsets (the last column of
From another aspect of whether the label (class) information is considered, feature reduction methodologies can also be classified into supervised and unsupervised ones. As we see, the label information may be difficult to access in many applications, and there are more and more datasets given without label information. Hence in this paper, we will concentrate on the unsupervised methods. As we can infer, because supervised methods take the auxiliary label information into consideration, they are probably more suitable for classification tasks, while unsupervised methods are prone to be more suitable for clustering tasks [
Generally speaking, in this paper, we proposed a flexible framework called S&H, which is capable of ordering feature subsets according to their relative importance (sorter). To cooperate with the sorter, we improved the traditional heuristic searching methodologies into orderbased ones, which can be called ordinal searchers. The above two components—sorter and ordinal searcher—compose our main structure to handle the feature selection problem, which is distinct from the traditional “evaluator and searcher” structure, as we concentrate on “orders” but not “values”. That property makes our structure more sensible and straightforward, because the underlying purpose of feature selection is just to find out the best feature pattern, but not to answer how superior that feature pattern is quantitatively.
As stated above, our S&H sorter framework was initially inspired by a simple intuitive principle, namely, if a feature subset has more representativeness, it should be more selforganized, and as a result it should be more insensitive to artificially injected noise points. That is to say, our S&H sorter can be divided into two main stages. The first stage is called “seeding”, and the second one is “harvest”. At the seeding stage, we inject some artificial noise points into the dataset, and in the harvest stage, we resort to a uniformly partitioningbased outlier detector [
Although derived from an intuitive principle, our methodology is based on solid theoretical foundations. The key points are listed as follows:
We modeled the featureselected clustering problem into a rigorous optimization form in mathematics.
We proposed the concept of coverability, which was proved to be an intrinsic property of a certain dataset.
We showed that solving the feature selection problem is equal to finding the specific feature pattern, under which the dataset exhibits the smallest coverability.
We found the correlation between coverability and the probability with which the seeded points can be detected correctly.
We eventually concluded that solving the feature selection problem is equal to finding the specific feature subset in which the seeded points can be extracted most exactly.
This paper is organized as follows: In Section 2, we review some related work. In Section 3, we present our main principles involved. The practical interpretation of the theories is given in Section 4, with some important considerations in practice. In Section 5, we describe the implementation of our methodology in detail, and provide the main algorithms in pseudocode. The comparison experiments on extensive datasets are analyzed in Section 6; and finally, our conclusions are presented in Section 7.
This section briefly reviews the stateoftheart feature selection algorithms, which can be categorized according to a number of criteria as we have illustrated in
A rather simple attribute ranking method is the information gain [
Relief [
CFS [
Consistencybased methods [
All the above are supervised feature selection methods. Compared with them, the unsupervised methods do not need class labels. Next, we will review some unsupervised methods.
A common category of unsupervised feature selection methodology is the one based on various clustering technologies. For example, Dy and Brodley proposed a clusterbased method [
There also exist other kinds of unsupervised methods. As we know, some transformationbased methods like PCA and FA are statistical unsupervised methods, which have been discussed in Section 1. Besides them, a spectrumbased method [
Generally speaking, the most significant difference between this work and other unsupervised methods resides in that, we are the first to resort to outlier detection technologies to study feature selection problems. This purpose is achieved by means of our fundamental theories, which will be covered in the next section.
Before introducing our theories, we believe that we should demonstrate the importance of feature selection through a simple but concrete example.
Let us consider the simple clustering problem illustrated in
Thus, we can conjecture that the most valuable information resides in the horizontal dimension. To clarify this point, we try to cluster this dataset using standard 2means method [
Because of the intuitive and heuristic natures of our methodology, it would be much more straightforward to explain through visible examples other than pure theories. Thus, in the following, as a beginning, we will represent the core ideas of our methodology through the analysis on a simple synthetic multidimensional dataset.
Let us inspect the synthetic dataset shown in
This figure gives the linked twodimensional scatter plots of our synthetic multidimensional dataset consisting of 4 independent attributes labeled
Now, let us inspect the fundamental problem of ordering these three attribute subsets ({
If we denote the merit of an attribute subset
Next, we consider what will happen if we inject some artificial noise points into the dataset.
First, let us inspect the plot of attribute subset {
Similarly, let us inspect
Finally, we inspect
As can be seen, the above 3 subplots (
In practice, if seeded points are more significant, then they are more likely to be identified from original points. That is to say, we can evaluate the relative importance of different attribute subsets in terms of how precisely the seeded points can be detected under these attribute subsets. This is indeed what Theorem 6 (of Section 3.5) will try to tell us. Hence, through this example, we have tasted the flavour of Theorem 6 from a practical point of view.
With the above intuitions, as a starting point of the theoretical analysis, we will present the modeling of standard clustering problems in the next section.
We consider a dataset
Now, let us consider the standard clustering problem. If we denote the set of all possible clustering patterns as
(a, a) ∈
Essentially speaking, the relation
Furthermore, based on the properties enumerated in Definition 1, we can define the best clustering pattern set (BCPS) as follows:
There is an interesting result under above definition.
∀x, y ∈ B, where B is the BCPS under Definition 2, we have F(D, x; = F(D, y;.
Here, we will prove it by contradiction. First, we assume that
Generally speaking, every clustering methodology has its own distinct CEF
Together with Definition 3, theorem 1 clarifies a simple truth, saying that all the clustering patterns in BCPS have equally maximized CEF value, which can be found out by solving the maximization problem expressed in
To make the above theories more concrete, the standard
According to
Because
If (a, b) ∈
Because either
Thus, we know
Theorem 2 tells us that,
In this subsection, we will investigate a special kind of CEF, called featureadditive CEF.
Featureadditive CEF If a CEF F can be expressed as:
Hence, by substituting
Again, we resort to
From
The introduction of featureadditive clustering is valuable, in the sense that the feature selection problem can be elegantly expressed as an optimization problem.
In
One may wonder how the optimization problem in
As discussed previously,
As we know, a clustering pattern can be expressed as a vector of point sets, denoted as c = (
Now, let us inspect cluster
If we treat
As we know,
With above definitions, we can give the rigorous definition of coverability now.
The following theorem can help us to interpret the essence of coverability more deeply.
Because
Because the infimum of WCSS for a specific dataset is definite, Theorem 4 essentially tells us that the coverability is an intrinsic property for a dataset and independent of any concrete clustering method. Reviewing Theorem 4, one may ask that, isn't WCSS good enough? And why did we bother to introduce the concept of coverability? Roughly speaking, what Theorem 4 presented is just one perspective to interpret the concept of coverability. And the essence of coverability can only be exposed from another point of view, where coverability is interpreted as the ability of a dataset to cover seeded points and make them difficult to identify. We will explain this in detail below.
What are seeded points? Look at
To determine the quantity of seeds, we denote the number of seeded points as
Now, let us try to interpret the term—
Next, let us consider the probability
From the above, we can summarize and make our fundamental hypothesis as follows.
As we have pointed out, coverability is an intrinsic property for a dataset, hence Hypothesis 1 essentially tells us that
Essentially speaking, the requirement that Hypothesis 1 imposes on an outlier detector is that the correct detection probability should be negatively correlated with the space covered by the original points. This requirement is so loose that Hypothesis 1 seems to be a characteristic feature of outlier detectors in general. In this paper, whenever we talk about an outlier detector, we exclusively refer to the ideal outlier detector, where Hypothesis 1 holds. In practice, the validity of Hypothesis 1 can be verified phenomenologically by experiments or mechanistically by theories. Through plenty of experiments and theoretical investigations, we have found that most existing outlier detectors can be treated as ideal outlier detectors to some extent. It again confirms that Definition 8 reveals a sort of general property for outlier detectors. In this paper, we will give a detailed description of the uniformly partitioningbased outlier detector in Section 4.1. Furthermore, in Section 4.2 we will prove that it conforms to Hypothesis 1.
From now on, we will take the feature selection effect into consideration, which is indicated by the vector
For cluster
Analogously to Definition 6, we can define
Thus, similar to Definition 7, the coverability for a featureselected dataset can be defined as
With the above discussions, we can define the optimal feature pattern as follows.
Again, we would like to explain Definition 9 in a concrete manner by investigating
From
By comparing
Essentially speaking, Theorem 5 reveals an important fact that, the feature selection task for
To make above discussions rigorous, first of all, we give a corollary of Hypothesis 1.
Corollary 1 is straightforward. If we treat the featureselected database as a new database, then in this new database,
Because of
Theorem 6 tells us that we can accomplish feature selection tasks by finding the particular feature pattern under which the seeded points can be extracted most probably. This methodology is simpler and more feasible than solving the optimization problem in
There are still some remaining problems, which need to be discussed in detail.
In this section, we are mainly planning to explain two important components of our framework in detail, namely the harvester and the searcher. Next, let us talk about our uniformly partitioningbased harvester as a beginning.
As stated above, if we treat the seeded points as outliers in original data points, the harvest procedure is essentially an outlier detection process. There are a lot of stateoftheart methods that can be employed. In this paper, a recent uniformly partitioningbased method called ordinal isolation [
It is simple and fast, with
It is scalable, because it arranges its main computations in a tree, whose branches can be pruned out during the proceeding of the whole algorithm.
More details for this algorithm can be found in the literature [
In this paper, although we adjusted the ordinal isolation algorithm somehow to be more suitable for our harvest tasks, we do not want to repeat the main principles of ordinal isolation here, which can be found thoroughly in the literature. However, we will try to present the detailed processing procedures of the harvester in a more practical way. That is, we will consider the simple example given in Section 3.1 again, and show the detailed processing procedures of harvester towards this simple problem.
We denote the operation of counting the number of isolated seeded points (dark crosses) as S(
If we define
Analogously, from
Finally, from
Thus, from
Note that the order given in
As what Definition 8 reveals, the uniformly partitioningbased outlier detector can be classified as the ideal outlier detector if and only if ∀
First, let us assume a situation illustrated in
In this situation, we only consider the seeded points, which are uniformly distributed in the data space. We carry out a recursively and uniformly partitioning procedure. When we reach the 32 × 32 partitioning stage, we notice from
Then, we consider what will happen when the original data points are populated into this data space. We illustrate this situation in
Second, when we consider the original data points as a whole, we can see that in the middle of
Now, let us consider how the original points act on the correct detection ratio.
First, we consider the position of the original points as a whole. That is to say, we consider the effect of a common position transposition for all the original points. In this situation, we can imagine that, because the seeded points are distributed uniformly, the state of interfering is also uniformly spread in the data space. That is to say, the transposition of original data points cannot significantly alter the correct detection ratio.
Second, we consider how the size of the original data points affects the correct detection ratio when the concentration sustains at a fixed level. As
Last, we should consider how the concentration of the original data points affects the correct detection ratio when its size sustains at a fixed level. In this situation, it is straightforwardly to see that when the ratio of affected points is fixed, if the concentration is increased, then it will be more likely that the original points can be isolated, which results in the detection of the original points rather than the seeded points and thus reduces the ratio of correct detection. So, as a whole, the correct detection ratio is negatively correlated with the concentration of original data points when its size is fixed.
Until now, we have been armed enough to investigate how the coverability of original points is correlated with its size and concentration. As we have discussed, the coverability of a dataset depict its spacecovering ability. And, as we proved in Theorem 4, the coverability of a dataset is equal to the infimum of WCSS. We can conclude that the coverability of original points is positively correlated with its concentration and size.
Generally speaking, from the above discussions, we can conclude that the coverability of original points is negatively correlated with the possibility (ratio) of correct detection. That is to say, the uniformly partitioningbased outlier detector we adopted is indeed one particular type of ideal outlier detectors.
In the next subsection, we will address why the “order” is superior to the “value” and explain the main principles of ordinal searching methodologies.
Most traditional heuristic searching methodologies are valuebased, where the searching directions are determined according to the merit values of attribute subsets. The cooperating pattern between heuristic searchers and attribute subset evaluators is illustrated in
From
The above questions are straightforward to answer. Let us take an example. If Tom is 1.75
“Order” is much more robust against noise and easier than “Value”.
Do not insist on getting the “Best” but be willing to settle for the “Good Enough”.
So, in this paper, we improve the traditional valuebased search methods into orderbased ones. Accordingly, the valuebased pattern in
The last question is: how we can get the order of attribute subsets by means of our seeding and harvest framework? Appealing to
From previous discussions, we see that our seeding and harvest framework is capable of sorting the input attribute subsets in terms of their relative importance. This order is used by orderbased searcher to determine the direction for the next searching step. The main structure of their cooperation has been illustrated in
In AI, heuristic search is a metaheuristic method for solving computationally hard optimization problems. Heuristic search can be used on problems that can be formulated as finding a solution maximizing a criterion among a number of candidate solutions. Heuristic search algorithms move from solution to solution in the space of candidate solutions (the search space) by applying local changes, until a solution deemed optimal is found or a time bound has elapsed [
There are a lot of stateoftheart heuristic searching algorithms that can be adopted in the feature selection applications. In this subsection, we will show how the simple greedy hill climbing searching algorithm can be transformed into a corresponding orderbased one.
First, Algorithm 1 gives the traditional valuebased greedy hill climbing searching method.
1:
In this algorithm, we evaluate all the possible directions for the next step and pick the direction with the highest merit gain. Obviously, it is valuebased, because it depends on merit values and comparisons.
Then, we transform Algorithm 1 into an orderbased searching algorithm, which is elaborated in Algorithm 2.
1:
In this algorithm, the
The purpose of Algorithm 2 is selfexplanatory. Note that a state in Algorithm 2 is virtually an attribute subset. Essentially speaking, line 4 of Algorithm 2 takes advantage of a socalled attribute subset sorter to order the sequence comprising the current state and all the possible child states derived from this state into an ordered sequence of attribute subsets. Hence the head of this sequence can then be treated as the next state, which is supposed to present the highest merit gain in practice. As we expect, the above procedure can be applied iteratively until the current state cannot be improved further. Then the corresponding attribute subset is the result of an ordinal feature selection task.
As we know, there are plenty of heuristic searching algorithms, such as best first search and genetic search. They can be transformed into ordinalbased ones analogously. In this paper, we adopt the method shown in Algorithm 2 as our ordinal searcher (
In this subsection, we will elaborate how to sort a sequence of attribute subsets by means of our seeding and harvest framework. As discussed previously, there are three main components in our algorithm. They are the seeding component, the harvest component, and the searcher component.
In
The searcher component has been studied thoroughly in Algorithm 2. The harvest component is virtually an implementation of the
Algorithm 3 elaborates the detailed implementation of the harvest component. Meanwhile, to make Algorithm 3 easier to follow, we draw a really “big” graphical guidance to illustrate the main structure of Algorithm 3 in
Algorithm 3 is implemented in a “level by level” manner as illustrated in
Maybe there remains a dummy question. Why do we bother to give a whole ordered list as the output—can we just give the best attribute subset instead? Of course, in the greedy hill climbing search, the answer is positive, because the ordered list will be eventually used to find out the best attribute subset. However, in terms of other more sophisticated searching methodologies where more information is demanded (not just the best attribute subset) to decide the searching direction, the answer is obviously negative. The above reasoning motivates us to implement the harvest algorithm in the manner of Algorithm 3 to potentially attain more flexibility.
In the next subsection, we will analyze the complexity of our method.
From Algorithm 3 we see that the whole process can stop when all the cells in
In
Now, let us talk about the number of attribute subsets. Here we denote the dimension of the original dataset as
In each level of Algorithm 3, a total scan of original dataset can achieve the partitioning mission, whose complexity is
In this section, we will carry out a series of experiments on plenty of reallife datasets from the UCI Machine Learning Repository [
All experiments were conducted on an Intel Core 2 PC, with two 1.80 GHz cores, 1GB main memory. Notice that each experiment runs as a single thread, which can only be processed by one core. Our method is implemented in Java with Eclipse IDE. Our experimental platform is Weka (Waikato environment for knowledge analysis) [
To integrate our method into Weka, first we add two main modules into the original Weka package “weka.attributeSelection”. One is “OrdinalGreedyStepwise”, which implements Algorithm 2, and the other is “SeedingAndHarvestSubsetSorter”, which implements Algorithm 3. These two modules change the traditional
However, this is just a beginning. In order to examine how well our method performs on given huge datasets, we must rely on the Weka Experimenter, which can do comparisons among different methods under varies conditions automatically [
The attributeselected wrapper for classifiers has been implemented by Weka already, so we can use it directly. Second, for clusterers, we want to measure the squared errors to compare their performance. Therefore, we implemented “AdditionalMeasureProducer” interface for a lot of corresponding modules. Because the details are tedious, we omit them here. In the next subsection, we will introduce the datasets we used in experiments.
To rank performance evidently, we adopted 10 benchmark datasets from UCI Machine Learning Repository [
We should notice that, although our method does not need any label (class) information since it is an unsupervised method, all the datasets we adopted contain label information, because we will compare our method with several supervised feature selection methods like CFS [
To evaluate the performance of our method, at the beginning, we compare it with 4 classical feature reduction methods, which are CFS [
How to compare the performance of feature reduction methods? As we have clarified, the main purpose of our methodology is to try to tackle the featureselected clustering problem described in Definition 5. Hence, we employ a methodology comparing the squared errors and loglikelihoods of clusterers after feature reductions. The more significantly a feature reduction method can reduce the squared errors or increase loglikelihoods of a clusterer, the better performance this method achieves. Brief descriptions about these clusterers are given as follows.
Standard
Hierarchical methods [
Simple EM (expectation maximization) methods [
First, we present the performance comparisons with classical feature reduction methods mentioned above.
From
Second, we would like to present the performance comparisons involving the five abovementioned unsupervised feature selection methods.
With the same datasets and experimental procedures of
From
Next, let us inspect how fast our method can achieve. In
First, let us talk about the feature reduction time. From the left part of
In this Figure, we illustrate feature reduction time in two scales, where the sequence numbers of datasets coincide with that listed in
When we inspect the total time section of
Next, we give the loglikelihood comparisons of featurereduced hierarchical clusterer in
Lastly, note that although the results in
In this paper, we proposed a novel twostage framework for feature reduction/selection. The first stage is random seeding and the second stage is uniformly partitioningbased harvest. Our new framework improved the traditional valuebased evaluation and searching schema into an orderbased one, which is much more effective, more efficient, and more robust. We did a series of experiments to compare our method with other stateoftheart feature reduction methods on several reallife datasets. The experiment results confirm that our method is superior to traditional methods not only in accuracy but also in speed.
Essentially speaking, our method transforms the feature reduction problem into the outlier detection problem. Because there are a lot of stateoftheart outlier detection methods, our framework can have plenty of variants. In this paper we only explored the uniformly partitioningbased method. This new framework is flexible for the facile integration of other outlier detection methods, which we will study in the future. Moreover, we can also adopt other seeding methodologies. In practice, because of the characteristics of outlier detection problems, our framework can achieve high tolerance of outliers in target datasets, which is an extraordinary feature of our framework.
Because of the simple and clear structure and levelbased implementation of our method, it can be parallelized easily, and we will implement and study the parallel version of our S&H algorithm in the future.
Categories of data reduction methods. The categories that our method belongs to are in boldface.
The effect of feature selection, where the only difference between the two clusters lies in the fluctuation of their horizontal means.
Scatter plots for the synthetic dataset consisting of 4 attributes:
The significance (relative importance) order of attribute subsets—{
The effect of seeding. Circles are original points and crosses are the artificially injected noise points.
An example. Bold circles are effective circles for the two clusters respectively. Those little circles are original data points, and crosses are seeded points with uniform distribution law. This dataset has been optimally clustered, and the points belonging to the leftbottom cluster have been marked by solid circles.
Recursively and uniformly partitioning on attribute subset {
Recursively and uniformly partitioning on attribute subset {
Recursively and uniformly partitioning on attribute subset {
The seeded points have all been isolated in this 32 × 32 partitioning.
The situation when original points (solid ones) have been injected.
Schema of traditional valuebased feature selection.
Schema of novel orderbased feature selection.
Relationship among the main components (shaded blocks).
The “big” structure of harvest algorithms, where “→” means “the variable is overwritten by …”.
Weka Explorer using our method.
Weka Experimenter using our method.
Feature reduction time comparisons.
Total time comparisons. In this figure, dataset sequence numbers denote ecoli, wdbc, sonar, yeast, segment, segmentation, waveform, sensor, magic, isolet sequentially.
1  ecoli  336  5  8 
2  wdbc  569  30  2 
3  segmentation  2310  15  7 
4  isolet  6238  617  26 
5  magic  19020  10  2 
6  segment  2310  17  7 
7  sensor  5456  23  4 
8  sonar  208  60  2 
9  waveform  5000  40  3 
10  yeast  1484  8  10 
Squared errors for feature selected SimpleKMeans (the fewer, the better).
ecoli  142.15  142.15  142.15  139.71  142.15  124.17  ●  
yeast  735.58  734.83  705.61  671.66  705.61  583.64  ●  
sonar  476.80  116.96  ●  30.09  ●  26.78  ●  36.00  ●  20.59  ● 
wdbc  212.10  76.66  ●  34.05  ●  29.31  ●  38.92  ●  4.34  ● 
segmentation  2343.31  2111.75  ●  1819.35  ●  1733.75  ●  1871.64  ●  1577.10  ● 
segment  2415.10  2118.06  ●  1790.06  ●  1653.19  ●  1800.69  ●  1509.59  ● 
waveform  5109.59  2895.61  ●  1920.29  ●  1951.29  ●  1986.36  ●  1571.62  ● 
sensor  10297.99  3470.06  ●  3015.37  ●  1636.99  ●  3634.48  ●  1815.81  ● 
magic  5552.81  1535.06  ●  1662.54  ●  3014.03  ●  2255.46  ●  3486.68  ● 
isolet  144413.40  52669.72  ●  6060.16  ●  5654.71  ●  6449.82  ●  5421.02  ● 
● statistically significant improvement
Comparisons of our method with classical feature reduction methods by squared errors of SimpleKMeans clusterer.
 

ecoli  ●  ●  ●  ● 
sonar  ●  
wdbc  ●  ●  ●  ● 
yeast  ●  ●  ●  ● 
segmentation  ●  ●  ●  ● 
segment  ●  ●  ●  ● 
sensor  ●  ●  ●  
waveform  ●  ●  ●  ● 
magic  ○  
isolet  ●  ●  ● 
●, ○ statistically significant improvement or degradation
Comparisons of our method with stateoftheart unsupervised feature selection methods by squared errors of SimpleKMeans clusterer.
 

ecoli  ●  ●  ●  ●  ● 
sonar  ●  ●  
wdbc  ●  ●  ●  ○  ● 
yeast  ●  ●  ●  
segmentation  ●  ○  ●  ●  ● 
segment  ●  ●  ●  ● ●  ● 
sensor  ●  ●  
waveform  ●  ●  ○  ●  ○ 
magic  ○  ●  
isolet  ●  ●  ● 
●, ○ statistically significant improvement or degradation
Run time comparisons (the less, the better).

 

ecoli  1.79  1.86  1.61  1.60  45.19  ○  4.67  8.80  ○  8.50  ○  9.88  53.76  ○  
wdbc  8.45  24.95  ○  17.56  ○  15.66  ○  507.98  ○  11.23  38.30  ○  25.82  ○  24.76  ○  516.19  ○ 
sonar  10.13  25.88  10.12  25.04  ○  145.69  ○  11.67  34.98  ○  12.81  29.91  ○  148.23  ○  
yeast  10.62  7.50  6.68  8.80  928.76  ○  16.28  47.58  ○  41.99  ○  45.05  ○  962.06  ○  
segmentation  21.88  61.06  ○  57.86  ○  21.65  4219.27  ○  46.72  150.10  ○  143.97  ○  133.80  ○  4292.17  ○  
segment  25.68  65.66  ○  57.81  ○  25.22  4609.78  ○  33.87  156.54  ○  125.21  ○  105.82  ○  4678.81  ○  
sensor  100.02  224.30  ○  205.02  ○  98.17  34498.79  ○  152.80  403.86  ○  336.86  ○  342.87  ○  34624.76  ○  
waveform  118.89  196.46  ○  128.53  ○  223.67  ○  49446.54  ○  150.47  619.43  ○  234.68  ○  412.53  ○  49597.85  ○ 
magic  230.87  516.89  ○  454.70  ○  248.80  191700.68  ○  711.69  754.85  755.00  872.79  192034.83  ○  
isolet  10206.08  94062.01  ○  13641.42  ○  78800.67  ○  1126765.89  ○  10237.04  101905.68  ○  13830.86  ○  86849.56  ○  1126904.47  ○ 
Our S&H is the comparison target; ○ means statistically significant degradation compared with S&H
Loglikelihood comparisons of featurereduced hierarchical clusterer.
segmentation  –47.18  –59.31  ●  –59.29  ●  –59.27  ●  –59.29  ● 
segment  –42.65  –55.01  ●  –55.01  ●  –55.01  ●  –55.01  ● 
ecoli  2.58  0.13  ●  0.15  ●  0.20  ●  0.15  ● 
wdbc  6.83  5.18  ●  5.18  ●  5.18  ●  5.18  ● 
yeast  8.08  6.64  ●  6.64  ●  6.57  ●  6.64  ● 
sonar  64.94  64.01  ●  64.01  ●  64.01  ●  64.01  ● 
● statistically significant degradation compared with our method
Loglikelihood comparisons of featurereduced simple EM clusterer.
segmentation  55.99  59.66  59.26  55.84  59.29  
segment  52.17  55.45  55.08  51.43  55.32  
ecoli  2.24  1.37  ●  1.55  ●  1.55  ●  1.56  ● 
yeast  7.35  6.89  6.89  6.89  6.89  
wdbc  8.15  5.18  ●  5.02  ●  6.30  ●  5.15  ● 
sonar  68.31  65.21  ●  65.17  ●  67.35  65.19  ● 
● statistically significant degradation compared with our method