3.4. Scenarios Using Different Parameters
This section describes five scenarios or test cases using different parameters values to test the novel method. The common parameters for all scenarios are k-min = 3 and k-max = 10. It is the same for all the cases since it is useful to specify the number of possible clusters, but useless in the decision made by the algorithm. The following scenarios show how the parameters affect the decision made by the algorithm.
Case 1: Interpretability oriented and high resolution of data
For this first scenario, let’s assume that it is more important to keep the feature names (interpretability) than it is to optimize feature integrity. Also, let’s assume that a good feature resolution is needed. For values, interpretability-oriented = 0.9, integrity-oriented = 0.1 and target-resolution = 85%.
Table 5 shows the results using this configuration.
In this table, we can see that the value of the best feature selection (FS) silhouette index (0.3905) is greater than the value of the best feature extraction (FE) silhouette index (0.3530). Note that the consistency of the clustering process has nothing to do with the resolution of data. Often, better consistency comes with less dimensionality. It can be very hard to have a good consistency with a high number of features. That is why when using this method, orienting a process toward integrity (by using feature extraction instead of feature selection) does not result in a better consistency while clustering. Often, lowering the resolution results in an SI shows a better consistency.
In this scenario, the parameter interpretability-oriented (0.9) has a higher value than the integrity-oriented value (0.1). When (
7) and (
8) are applied, the interpretability score (0.3514) is higher than the integrity score (0.0353). To reach 85% of the resolution, we must use the best seven features. These features have a resolution of 88.3%. Doing the clustering process, the optimal number of clusters is three.
Figure 4 shows the distribution of each element according to its silhouette index.
This figure shows the three different clusters, in three different colors. The larger the horizontal bar, the more data the cluster contains. The longer the bar, the more consistent the data is according to its cluster. This graphic shows very few misplaced values (negative values). It also shows an average of 0.3905 (red dotted line).
Figure 5 shows the representation of cluster A using a stacked radar graphic. It is easy to visualize the consistency of the normalized value. It shows that the feature names have been kept. It is the most important criterion (interpretability) for this case since the parameter interpretability-oriented is equal to 0.9.
Case 2: Integrity-oriented and high resolution of data
In this second scenario, integrity is more important than keeping the signification of the features (interpretability). A good feature resolution is also needed. For values, interpretability-oriented = 0.1, integrity-oriented = 0.9 and target-resolution = 85%.
Table 6 shows the results for this configuration.
We can observe that the value of the best FE silhouette index (0.3530) is lower than the value of the best FS silhouette index (0.3905). Having an integrity parameter with a high value (0.9), the integrity score (0.3177) is higher than the interpretability score (0.0390). The feature extraction strategy is selected. Seven PCs are required to reach 85% of resolution. The optimal number of clusters is three.
Figure 6 shows the distribution of the SI for this case.
This figure displays the three clusters. There are a few misplaced values (between −1 and 0). The average of the SI is 0.353. Even if the consistency of the clustering is lower, the integrity of the data is better since every feature has been used to downsize to the seven PCs. The loss is kept at a minimum. Remember that the SI often becomes lower when reducing dimensionality. For instance, having only two PCs or features tends to provide the best SI results.
Table 7 shows cluster A using a stacked radar graphic.
Keep in mind that when a feature extraction is made, all the feature’s labels are lost. In this particular case, for instance, it becomes impossible to refer to feature 5 “Education Score”, since this value, like all the others, has been extracted to generate the new features called “principal components” (PCs). Original features can no longer be addressed. This may be an important drawback, depending on what has to be done next. For instance, if a clustering process is made (like in
Figure 7), the clustering graphs would be represented having “PC1”, “PC2”, “PC3”, and so on, on its axis. Having fewer dimensions is an advantage; losing the identity of the features is a disadvantage and the opposite of “interpretability".
Case 3: Equally integrity and interpretability oriented and high resolution of data
For this third scenario, we assume that it is equally important to keep feature signification in addition to optimizing the integrity of the features. We also assume that a good feature resolution is also needed. For values, interpretability-oriented is 0.5, integrity-oriented is 0.5 and target-resolution is 85%.
Table 7 shows the results for this configuration.
The best FS silhouette index (0.3905) is greater than the value of the best FE silhouette index (0.3530). After applying the Equations (
7) and (
8) using the interpretability-oriented and integrity-oriented parameters, the interpretability score (0.1952) is higher than the integrity score (0.1765), so the selection process is used.
If interpretability and integrity are equally important, the nature of the data will determine which process is the best at generating a good SI (good clustering consistency). This plays a role in Equations (
7) and (
8).
To reach 85% of the resolution, we must use seven features, having a resolution of 88.3%. In the clustering process, the optimal number of clusters is three. The SI figure and the stacked radar graphic are the same as in scenario 1 (
Figure 5 and
Figure 7).
Case 4: Interpretability-oriented and low resolution of data
This case is oriented toward interpretability. Compared to case 1, the resolution value has been lowered. For values, interpretability-oriented = 0.9, integrity-oriented = 0.1 and target-resolution = 50%. The results are shown in
Table 8.
The value of the best FS silhouette index (0.4393) is greater than the value of the best FE silhouette index (0.3775). Same as in case 1, the chosen method is feature selection because the parameter interpretability-oriented (0.9) has a higher value than the integrity-oriented value (0.1) and the interpretability score (0.3953) is higher than the integrity score (0.0377). To reach 50% of the resolution, we must use the best four features. Those features have a resolution of 52.3%. Doing the clustering process, the optimal number of clusters is three.
Figure 8 shows the distribution of each element according to its silhouette index.
This figure shows the three different clusters. The graphic shows no misplaced values (negative values) and also shows an average of 0.4393 (red dotted line).
Figure 9 shows the representation of the clustering using a stacked radar graphic.
The feature names have been kept as this scenario is oriented toward interpretability. Compared to case 1, which has a good resolution of data (seven features), this graphic shows only four features since the resolution value has been lowered to 50%. At a glance, we can see that there is a good consistency. It has an even better consistency than in case 1 (SI = 0.4392 for case 4 and SI = 0.3905 for case 1). Recall that a better consistency is often linked to fewer dimensions in the data.
Case 5: Integrity oriented and low resolution of data
For this last scenario, integrity is more important than keeping feature signification, but a lower feature resolution than in case 2 is defined. For values, interpretability-oriented = 0.1, integrity-oriented = 0.9 and target-resolution = 50%.
Table 9 shows the results using this configuration.
The best FS silhouette index (0.4393) is greater than the value of the best FE silhouette index (0.3775). The interpretability score is low (0.0439) and the integrity score (0.3397) is high. A feature extraction process is selected. Four PCs are required to reach 50% of the resolution. The optimal number of clusters is three.
Figure 10 shows the distribution of the SI for this case.
This figure shows three clusters and one misplaced value (located between −1 and 0). The average of the SI is 0.3775, which shows a significantly better SI than in case 2 (0.353), which has more dimensions.
Figure 11 presents a stacked radar graphic of cluster A.
The consistency is quite good. Although, like all the clusters whose features were kept during the feature extraction process, the feature names are lost and replaced by PCs, resulting in reduced interpretability.
3.5. Method Validation
The last part of the analysis is the validation of the method. This paper presents a novel approach with no other comparable published methods. It has no pretension of improving PCA or FRSD. It uses these algorithms but has different inputs and outputs as well as having distinct parameters. The improvement of the proposed approach is that it makes correct decisions about the reduction of dimensionality method and the number of features/PCs to keep. This novel method is a whole decision process that includes the evaluation of feature importance, a decision process based on parameters and data profiles, clustering, and the presentation of clusters. There is no known or documented method that can be used to compare the present method.
That being said, it is crucial to validate the algorithm. To ensure that the algorithm makes the correct decisions, 250 realistic random cases have been generated. Each of the random cases includes a random SI index (after a hypothetical feature selection), a random SI index (after a hypothetical feature extraction), and a random interpretability importance parameter. An integrity importance parameter has also been computed using 1-(the interpretability importance) parameter. Using this data, the decision algorithm is applied. For each case, an interpretability score and an integrity score have been calculated. A decision is then taken between feature selection or feature extraction.
Figure 12 shows the classification of the points according to the interpretability scores and the integrity scores. The red points use a feature extraction process and the blue points use a feature selection process. The black line divides the interpretability (feature selection) and the integrity (feature extraction) domains.
The points cannot have a high value on both axes because interpretability importance parameters are the inverse of the integrity importance parameters (
alpha and
1 − alpha). These are used in (
7) and (
8), which are the axis. If one value is very high, the other must be very low. Both can have an average value. This graphic shows that the algorithm always makes a good decision, even when a human may have difficulty choosing. Since the algorithm uses a threshold, the classification is always correct. Hence, 250 points is sufficient to show the distribution of the results. This graphic is simple, but it validates the results of the complex previous parts of the process that uses FRSD and PCA.
Figure 13 shows the bar pairs of the number of features (blue) and the principal components (red), according to the target resolution of data (as specified in the parameters). As in the previously described scenario test cases, the London dataset has been used.
As expected, we can see that a perfect resolution of 100% requires all of the eight available features. This number slowly declines when subtracting each step of 10%. To validate the integrity advantage of the feature extraction over the feature selection, we subtract their respective resolutions. It can be compared only when they have the same number of features and principal components.
Figure 14 displays the difference percentage for all the target resolutions having the same amount of features/PCs. For instance, reading the
Figure 13 we can see that resolutions of 20%, 30%, 50%, 60%, 70%, 80% and 100% have the same number of features/PCs. This is where the values of
Figure 14 are defined. Blue bars represent the resolution percentage differences between feature extraction and feature selection.
We can see that there is a resolution advantage when using feature extraction. This validates the integrity-oriented parameter.
As for the interpretability-oriented parameter, the best way to validate it is simply to compare graphs after a feature selection and a feature extraction. For instance, let’s compare
Figure 5 to
Figure 7. It is easier to interpret real features names as in
Figure 5 than it is to interpret abstract principal components (PC1, PC2...) (as in
Figure 7). This validates the interpretability-oriented parameter.