To establish a performance benchmark, we trained the same predictive models using the original feature set without applying any dimensionality reduction. This baseline resulted in an accuracy of 0.9123, which is lower than the best-performing models using PCA (0.9699) and KPCA (poly) (0.9624). These results highlight the benefit of applying dimensionality reduction, not only to improve computational efficiency but also to enhance model generalization and clustering structure.
4.1. Experiment 1: Early Fusion
The interpretation of the SCORE becomes relevant when comparing different dimensionality reduction methods, allowing for the determination of which one is most suitable for a specific task based on the nature of the data and the analysis objectives. Therefore, this metric plays a fundamental role in the selection of preprocessing techniques that optimize data representation and facilitate the extraction of meaningful insights [
26].
In the analysis of dimensionality reduction and performance evaluation of different methods, the following results were observed:
KPCA with a linear kernel: This method achieved an accuracy of 0.9624 and a silhouette score of 0.2405, indicating a notable performance. This suggests that the use of the linear kernel was effective in capturing the linear relationships present in the data, providing a representation that favors high predictive capability.
KPCA with a polynomial kernel: This method achieved an accuracy of 0.9248 and a silhouette score of 0.3244. Although these values were lower than those obtained with the linear kernel, they demonstrate its ability to capture non-linear relationships between the variables. This performance highlights the utility of the polynomial kernel in contexts where the interactions between features do not follow a strictly linear pattern.
KPCA with an RBF kernel: With an accuracy of 0.8496 and a silhouette score of 0.1231, this method yielded results inferior to those achieved with the linear and polynomial kernels. While the RBF kernel is known for its ability to handle complex and non-linear relationships, in this case, its performance was limited compared to the other evaluated alternatives.
PCA (Principal Component Analysis): This method achieved the highest accuracy among all those evaluated, with a value of 0.9699 and a silhouette score of 0.2405. These results demonstrate that PCA was highly effective in capturing the most significant variance in the data, offering a representation that enhances the predictive capability of the model.
IPCA (Incremental Principal Component Analysis): This method achieved an accuracy of 0.9398 and a silhouette score of 0.2329, reflecting a notable performance. Although slightly less effective than PCA, it stands out for its ability to process large volumes of data in batches, making it a valuable option in environments with limited computational resources.
Regarding the dimensionality reduction methods oriented toward visualization, t-SNE and UMAP, the results obtained reflect a lower capacity for structured separation of the classes in the projected space.
t-SNE (t-Distributed Stochastic Neighbor Embedding) models the similarities between data points by computing probabilities in the original (high-dimensional) space:
Probabilities in the embedded (low-dimensional) space:
Cost function (KL divergence):
UMAP (Uniform Manifold Approximation and Projection) focuses on preserving similarities in the original space, allowing for a more faithful low-dimensional representation:
where
is the distance to the nearest neighbor.
Similarities in the embedded space:
where
a and
b are parameters.
Cost function (cross-entropy):
In the tests performed, t-SNE achieved a silhouette score of 0.0473, while UMAP obtained 0.0724, indicating that both techniques presented low intra-class cohesion and poor separation between groups (see
Figure 8 and
Figure 9).
The KPCA (poly) method offers a very good compromise between classification accuracy and the quality of data separation. Although its score of 0.9248 is slightly lower than that achieved by PCA and KPCA (linear), it remains quite high and competitive. What truly stands out about this technique is its silhouette score of 0.3245, the highest among all the evaluated methods, indicating a better structuring and separation of the groups in the reduced space (see
Table 4).
This combination of good accuracy and a notable ability to capture non-linear relationships makes KPCA with a polynomial kernel a particularly valuable alternative in scenarios where it is crucial to adequately represent the internal structures of the dataset to improve interpretation or enhance the performance of subsequent models (see
Figure 10).
Linear and kernel-based techniques, such as PCA, IPCA, and KPCA, tend to show higher score values, which could be interpreted as better preservation of variance in dimensionality reduction or as higher performance in classification tasks, in case the score refers to the model’s accuracy.
4.2. Experiment 2: Fusion Late
The first dataset corresponds to meteorological observations recorded in the pitahaya cultivation environment during the years 2022 and 2023. Each record represents a specific measurement of environmental and categorical variables oriented toward the analysis of shade conditions and light coverage. Parameters such as temperature, relative humidity, dew point, and constant wind speeds are included. The variables “month” and “year” originate from data captured by sensors configured to record information every five minutes during the aforementioned period. Likewise, the “shadow” index quantifies the perceived shade intensity, while the “group” variable, generated through clustering techniques, allows for the segmentation of samples into groups with similar characteristics (see
Table 5).
In the process of late data fusion, dimensionality reduction is applied independently to each dataset before proceeding with their integration. This approach simplifies the representation of each dataset, eliminating redundant or irrelevant variables that could introduce noise and hinder subsequent integration. Through methods such as Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE), the number of data dimensions is reduced, preserving the most significant underlying relationships. After this reduction, the datasets are fused, which optimizes the joint analysis process, improving the model’s efficiency and effectiveness. This approach contributes to minimizing computational complexity and improving the ability to identify patterns or correlations between the various sources of information.
Dimensionality reduction methods, both linear and kernel-based, exhibit variable performance depending on the type of kernel used and the metric employed for their evaluation. In this context, the KPCA method with a polynomial kernel stands out as the most balanced option, achieving the highest score (0.9398) and the highest silhouette score (0.3534), which indicates adequate classification capability and a projection that effectively preserves the structure of the data. Both KPCA with a linear kernel and PCA show identical results in both metrics (score of 0.9323 and silhouette score of 0.1754), suggesting that the use of a linear kernel in KPCA does not offer additional advantages over a traditional linear projection like the one performed by PCA.
For its part, KPCA with an RBF kernel exhibits a considerably lower score (0.8045) and a negative silhouette score (−0.0053), which could be interpreted as poor separation between classes after reduction, possibly due to overfitting or a poor adaptation of the kernel to the dataset. As for IPCA, this method shows intermediate performance, with a score of 0.8797 and a silhouette score of 0.1408, representing a decrease in accuracy compared to PCA (see
Table 6).
Finally, although t-SNE and UMAP are not oriented toward classification tasks and therefore do not have associated scores, their respective silhouette scores (0.1370 and 0.0474) indicate a limited ability to clearly separate groups in the projected space. This can be attributed to their primary focus on preserving local rather than global relationships (see
Figure 11 and
Figure 12).
Taken together, the results obtained reinforce the conclusion that KPCA with a polynomial kernel offers the best combination of structured projection and classification performance among the evaluated methods (see
Figure 13).
The second dataset originates from an experiment conducted during the years 2022 and 2023, focused on measuring the chlorophyll content in pitahaya plants (
Hylocereus spp.). Each record represents an individual observation, in which contextual and numerical variables relevant to the experimental design are integrated. Among these are the month and year of collection, as well as a description field indicating whether the pitahaya plant is grafted or ungrafted. Variables related to the applied treatment, its repetition, and the plot location are also recorded. Additionally, quantitative measures such as indexcl, which represents the chlorophyll index, and bringht are considered. Finally, the variable group, generated through unsupervised clustering techniques, allows for the classification of observations based on similarity patterns among the different records (see
Table 7).
The comparative analysis reveals that the KPCA (linear), PCA, and IPCA methods achieve the best results, reaching a maximum score of 0.9549 and consistent silhouette score values around 0.225, which demonstrates their ability to effectively preserve the intrinsic structure of the data, making them robust options for clustering processes (see
Table 8).
Upon examining the variants of the KPCA method, it is observed that KPCA (poly) exhibits a particular performance: although its score (0.9023) is slightly lower than that of other methods, it registers the highest silhouette score (0.2743), suggesting a better definition of clusters despite presenting a less optimal dimensional reduction. In stark contrast, KPCA (rbf) demonstrates the poorest performance of the group, with a score of 0.8496 and a notably low silhouette score (0.1117), a situation that could be attributed to overfitting problems or a suboptimal selection of the kernel parameter (see
Figure 14).
In the case of non-linear techniques, t-SNE and UMAP present a significant peculiarity: while score values are not reported for these methods, their extremely low silhouette scores (0.0996 and 0.0280, respectively) highlight a fundamental limitation. Despite their recognized effectiveness for visualization tasks, these algorithms do not adequately maintain the global structures necessary for clustering (see
Figure 15).
Based on these findings, it is suggested that PCA, IPCA, or KPCA (linear) be considered as preferred options for clustering applications due to their demonstrated consistency. KPCA (poly) emerges as a valuable alternative when prioritizing the balance between the score metric and the quality of cluster separation. It is important to note that t-SNE and UMAP, despite their utility in data visualization, are not suitable for clustering purposes, as their design is not oriented toward the preservation of the essential metrics for this type of analysis (see
Figure 16). The analysis of the merged sensor and chlorophyll data reveals that the linear methods KPCA (linear) and PCA achieve the best results, with a maximum score of 0.9436 and a consistent silhouette score (0.2003), demonstrating their effectiveness in integrating diverse data. IPCA, although with a slightly lower performance (score: 0.9173), remains a viable alternative thanks to its comparable silhouette score (0.1828). These results highlight the robustness of linear approaches for this type of analysis (see
Table 9).
Among the non-linear methods, KPCA (poly) stands out by achieving the highest silhouette score (0.3138), making it ideal for applications where cluster separation is critical, despite its somewhat lower score (0.9211). Conversely, KPCA (rbf) exhibits the worst performance (score: 0.8271, silhouette: 0.0532), indicating its poor suitability for these data. Finally, techniques such as t-SNE and UMAP, while useful for visualization, show very low silhouette scores (0.1183 and 0.0377, respectively), confirming their limitation for clustering tasks and relegating their use mainly to exploratory purposes.
The linear methods KPCA (linear) and PCA achieve the best results in dimensionality reduction, with scores of up to 0.9624 in early fusion, consistently outperforming late fusion by an average margin of 2.1%. This advantage is maintained in cluster quality, where KPCA (poly) leads with silhouette scores of 0.3245, demonstrating that early fusion not only optimizes processing but also improves cluster definition by 12.6% compared to late fusion (see
Table 10).
However, the results show significant limitations in some methods: KPCA (rbf) exhibits the worst performance among the kernel techniques, while t-SNE and UMAP are inadequate for clustering, displaying marginal silhouette scores. These findings suggest that, for the integration of sensor-chlorophyll data, the optimal combination would be PCA with early fusion for general dimensionality reduction, or KPCA (poly) when cluster separation is prioritized, discarding t-SNE/UMAP for analytical purposes due to their low effectiveness in preserving clusterable structures (see
Figure 17).
4.3. Experiment 3: Friedman Test and Nemenyi Post Hoc Test
The Friedman test is applied to determine if there is a significant difference between the evaluated methods, yielding a chi-squared (
) statistic = 12.0000 and a
p-value = 0.0174. Given that the
p-value is less than 0.05, it is concluded that there are significant differences between at least two of the methods. Consequently, it is evident that the performance metrics (score and silhouette score) of at least one pair of methods differ significantly. This result constitutes the first indication of variability in the performance of the analyzed methods.
Due to the existence of a significant difference, the Nemenyi post hoc test is applied to compare all possible pairs of methods. As a result, significant comparisons are identified. Firstly, when comparing KPCA (linear) with KPCA (rbf), a
p-value of 0.0564 is obtained, which is just below the 0.05 threshold, indicating a significant difference between these two methods in terms of the evaluated metrics. Although the
p-value is very close to the limit, it is sufficient to conclude that the performance of KPCA (linear) and KPCA (rbf) differs significantly.
On the other hand, the comparison between KPCA (poly) and KPCA (rbf) yields a
p-value of 0.0999, a value greater than 0.05, suggesting that there is no significant difference between these two methods. Similarly, when comparing PCA with KPCA (rbf), a
p-value of 0.0564 is obtained, again evidencing a significant difference in performance, consistent with what was observed between KPCA (linear) and KPCA (rbf) (see
Table 11).
In contrast, several comparisons do not show significant differences. Specifically, the pairs KPCA (linear) vs. PCA, KPCA (poly) vs. PCA, PCA vs. IPCA, among others, present p-values of 1.0000 or greater than 0.0564, indicating the absence of substantial differences in performance between these methods.