Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation

Chango, Wilson; Mazón-Fierro, Mónica; Erazo, Juan; Mazón-Fierro, Guido; Logroño, Santiago; Peñafiel, Pedro; Sayago, Jaime

doi:10.3390/computation13060137

Open AccessArticle

Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation

by

Wilson Chango

^1,2,*,†

,

Mónica Mazón-Fierro

^3,†

,

Juan Erazo

^4,†

,

Guido Mazón-Fierro

^5,†

,

Santiago Logroño

^2,†

,

Pedro Peñafiel

^6,†

and

Jaime Sayago

^1,*,†

¹

Department of Systems and Computation, Pontifical Catholic University of Ecuador, Esmeraldas Campus PUCESE, Esmeraldas 080101, Ecuador

²

Faculty of Informatics and Electronics, Escuela Superior Politécnica de Chimborazo ESPOCH, Riobamba 060155, Ecuador

³

Faculty of Engineering, University of Chimborazo UNACH, Riobamba 060101, Ecuador

⁴

Faculty of Mechanical Engineering, Escuela Superior Politécnica de Chimborazo ESPOCH, Riobamba 060155, Ecuador

⁵

Faculty of Business Administration, Escuela Superior Politécnica de Chimborazo ESPOCH, Riobamba 060155, Ecuador

⁶

Environmental Engineering, Escuela Superior Politécnica de Chimborazo ESPOCH, Riobamba 060155, Ecuador

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computation 2025, 13(6), 137; https://doi.org/10.3390/computation13060137

Submission received: 26 April 2025 / Revised: 12 May 2025 / Accepted: 15 May 2025 / Published: 3 June 2025

Download

Browse Figures

Versions Notes

Abstract

This study addresses the critical need for effective data fusion strategies in pest prediction for pitahaya (dragon fruit) cultivation in the Ecuadorian Amazon, where heterogeneous data sources—such as environmental sensors and chlorophyll measurements—offer complementary but fragmented insights. Current agricultural monitoring systems often fail to integrate these data streams, limiting early pest detection accuracy. To overcome this, we compared early and late fusion approaches using comprehensive experiments. Multidimensionality is a central challenge: the datasets span temporal (hourly sensor readings), spatial (plot-level chlorophyll samples), and spectral (chlorophyll reflectance) dimensions. We applied dimensionality reduction techniques—PCA, KPCA (linear, polynomial, RBF), t-SNE, and UMAP—to preserve relevant structure and enhance interpretability. Evaluation metrics included the proportion of information retained (score) and cluster separability (silhouette score). Our results demonstrate that early fusion yields superior integrated representations, with PCA and KPCA-linear achieving the highest scores (0.96 vs. 0.94), and KPCA-poly achieving the best cluster definition (silhouette: 0.32 vs. 0.31). Statistical validation using the Friedman test (

χ^{2}

= 12.00, p = 0.02) and Nemenyi post hoc comparisons (p < 0.05) confirmed significant performance differences. KPCA-RBF performed poorly (score: 0.83; silhouette: 0.05), and although t-SNE and UMAP offered visual insights, they underperformed in clustering (silhouette < 0.12). These findings make three key contributions. First, early fusion better captures cross-domain interactions before dimensionality reduction, improving prediction robustness. Second, KPCA-poly offers an effective non-linear mapping suitable for tropical agroecosystem complexity. Third, our framework, when deployed in Joya de los Sachas, improved pest prediction accuracy by 12.60% over manual inspection, leading to more targeted pesticide use. This contributes to precision agriculture by providing low-cost, scalable strategies for smallholder farmers. Future work will explore hybrid fusion pipelines and sensor-agnostic models to extend generalizability.

Keywords:

data fusion; precision agriculture; pitahaya; dimensionality reduction; pest prediction; Amazonian crops

1. Introduction

Pitahaya cultivation in the Joya de los Sachas canton, located in the Amazon region of Ecuador, has experienced significant growth since 2015 due to increasing national and international demand, which has boosted local economic activity [1]. However, this expansion is increasingly threatened by the presence of pests and diseases that reduce crop yield and quality. Conventional agricultural practices have proven insufficient for effective pest control, particularly in regions like the Ecuadorian Amazon, where high climate variability and limited access to technological tools hinder early detection and rapid response [2].

While the global agricultural sector has seen promising developments in digital agriculture and artificial intelligence (AI) for pest detection and prediction, these innovations remain underutilized in tropical, resource-constrained settings. This technological gap is especially evident in specialty crops such as pitahaya, for which the research on predictive modeling is scarce. Addressing this challenge requires the integration of heterogeneous data sources—such as real-time environmental sensor data and manual field observations—through robust data fusion strategies [3]. Such integration enables a more holistic understanding of crop-pest-environment interactions by capturing both abiotic conditions (e.g., temperature, humidity, rainfall) and crop-level responses (e.g., pest incidence, chlorophyll content), which are difficult to model in isolation.

Traditional pest prediction methods have historically relied on agroclimatic indicators, including degree-day accumulations, humidity thresholds, and host phenology [4,5]. While effective under controlled conditions, these approaches often fail to account for the complex, non-linear dynamics of tropical agroecosystems. More recently, the use of multi-source remote sensing—such as UAV imagery, satellite-derived indices (e.g., NDVI, LST), and hyperspectral reflectance—has enabled large-scale monitoring of vegetation stress and habitat suitability for pests [6,7]. Building on this, several studies propose integrated frameworks that combine meteorological, biological, and geospatial data to create spatially explicit risk maps and early warning systems. These frameworks highlight the importance of data fusion techniques—particularly early fusion (merging heterogeneous inputs) and late fusion (combining outputs of specialized models)—to exploit complementary information across sources.

The data analyzed in this study exhibit high dimensionality across temporal (e.g., hourly environmental sensor data), spatial (e.g., plot-specific field measurements), and spectral (e.g., reflectance signatures from leaf tissues) domains. If not properly managed, this complexity can introduce redundancy and noise into machine learning models. Dimensionality reduction techniques, such as principal component analysis (PCA), kernel PCA, t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP), are essential to retain only the most informative features for model training and interpretation.

This study aims to design and validate a data fusion framework specifically tailored to the needs of pitahaya cultivation in the Joya de los Sachas canton. The proposed system integrates meteorological data from IoT-based sensors (e.g., temperature, humidity, rainfall) with manually collected field data on pest incidence, which are differentiated between grafted and non-grafted pitahaya varieties. The integration of these data sources, coupled with dimensionality reduction and supervised learning models, seeks to enhance prediction accuracy while minimizing false alarms and reducing the over-application of chemical treatments.

Prior research provides a solid foundation for this approach. For instance, ref. [8] achieved 88.01% accuracy in predicting pest outbreaks in pepper crops using a digital twin system based on random forest models and data integration. Ref. [9] demonstrated that deep neural networks can anticipate disease onset in strawberries by modeling sequential environmental variables. Likewise, ref. [10] presented a multi-branch fusion network that successfully integrated environmental and physiological data to improve classification performance. Despite their contributions, these studies focus primarily on crops from temperate regions and lack adaptation to tropical, smallholder agroecosystems.

In Ecuador, recent studies underscore the need for integrated pest management strategies to ensure the sustainability of pitahaya production. For example, refs. [11,12] emphasize the importance of combining agronomic knowledge with technological tools, yet they fall short of implementing real-time sensor integration. Additionally, ref. [13] advocates for the incorporation of advanced technologies across the crop value chain but does not address the operational challenges of deployment in low-resource settings. Three key gaps persist in the current literature: (i) the absence of validated pest prediction models tailored to pitahaya in tropical agroecosystems; (ii) insufficient integration of heterogeneous data sources; (iii) limited operationalization of AI tools at the farm level.

To address these challenges, this research proposes a validated framework that fuses environmental and field-level data to support pest prediction in pitahaya cultivation. The guiding research question is the following: What data fusion techniques are most effective for improving pest prediction in pitahaya crops in the Joya de los Sachas canton, given the integration of automatic data from sensors and manual data on grafted and non-grafted pitahaya?

The key contributions of this work are as follows: (1) a comparative analysis of early and late data fusion strategies in a tropical agroecosystem context; (2) the application of dimensionality reduction techniques to manage temporal, spatial, and spectral data complexity; (3) a field-validated system that improves pest prediction accuracy and supports sustainable agriculture for small-scale Amazonian producers.

2. Background

2.1. Pests in Pitahaya

Pitahaya (Hylocereus spp.), commonly known as dragon fruit, is a crop of high economic value due to its nutritional and medicinal properties. However, its production faces significant challenges related to pests and diseases that affect its yield [14]. Major pests include aphids (Aphididae), thrips (Thysanoptera), and mealybugs (Pseudococcidae), which weaken the plant by sucking sap and can act as vectors for viral diseases. Regarding diseases, anthracnose, fruit and stem rot, and cankers are common, with a direct impact on product quality [15]. Early detection is crucial and can be improved by chlorophyll sensors (SPAD index) and climate prediction models based on historical data. However, in Joya de los Sachas, pest management strategies are predominantly reactive rather than predictive, relying on manual inspection and chemical treatments without systematic environmental monitoring, which limits early response capabilities.

2.2. Data Fusion

Data fusion is fundamental to multimodal analysis, as it enables the integration of information from multiple sources to improve understanding and optimize decision-making [16]. There are three main approaches: early fusion, late fusion, and hybrid fusion [17]. Early fusion combines features from different sources into a single representation before model training [18]. This approach captures intrinsic relationships from the outset and is computationally efficient (Figure 1).

Late fusion processes each data source independently, trains separate models, and then combines their results (Figure 2). It is useful when sources have different spatial/temporal resolutions [19].

Hybrid fusion integrates the best of both methods (Figure 3) [20], making it suitable for complex agricultural problems.

In tropical agriculture, these strategies have been applied to crops like mango and papaya, combining meteorological data, satellite imagery, and soil analysis to predict flowering and pest outbreaks. However, current applications rarely consider the integration of real-time sensor data with manual field records, particularly for specialized crops like pitahaya. The proposed data fusion system in this study differs from current pest management in Joya de los Sachas by performing the following: (i) integrating real-time meteorological and agronomic data; (ii) applying early and late fusion strategies tailored to heterogeneous sources; (iii) enabling predictive rather than reactive pest management. This approach aims to bridge the operational and technological gaps observed in existing local practices.

3. Methodology

The present study proposes a structured methodology composed of three stages to identify the most effective data fusion techniques for improving pest prediction in pitahaya crops in the canton of Joya de los Sachas, Ecuador. This methodology integrates automated data obtained from IoT sensors and manually collected field data and is based on the principles of Design Science Research (DSR); it follows the relevance, design, and rigor cycles [21] (see Figure 4).

3.1. Study Area Characterization

The study was conducted in the Joya de los Sachas canton, located in the Orellana province of the Ecuadorian Amazon. The area presents a humid tropical climate, with average annual precipitation above 3000 mm and average temperatures ranging between 24 °C and 27 °C. Soils are classified as entisols and inceptisols, with moderate fertility. Pitahaya plantations are located at altitudes between 250 and 320 m above sea level, with canopy cover ranging from 45% to 70%, which impacts microclimatic conditions relevant to pest proliferation. The ecological setting includes typical Amazonian flora, high biodiversity, and dynamic interaction between abiotic and biotic stressors, which must be considered in predictive modeling.

3.2. Experimental Design and Data Acquisition

An observational experimental design was adopted with stratified random sampling over 12 productive plots (1 ha each). Each plot was divided into quadrants of 25 m², selecting 10 quadrants per plot for intensive monitoring.

IoT data acquisition included sensors for temperature, humidity, rainfall, soil moisture, and leaf wetness. The sensors used were the following:

DHT22 (humidity and temperature);
YL-69 (soil moisture);
Rain Gauge RG-100;
Leaf Wetness Sensor LWS-150.

The data were recorded every 15 min and transmitted via LoRaWAN to a central data server. Manual data included pest visual counts, phenological stage, chlorosis index, and photographic evidence.

3.3. Preprocessing Workflow

Data preprocessing involved the following sequential workflow:

Noise filtering: Time-series smoothing via moving average with a 3-point window.
Normalization: Min–max scaling to [0, 1] for temperature, humidity, and chlorophyll index.
Feature extraction: Temporal aggregations (daily mean, variance), derivatives ( $Δ$ humidity), and vegetation stress indicators (NDVI derived from RGB images).
Missing data imputation: Linear interpolation for temporal gaps < 2 h, KNN-imputation otherwise.

This preprocessing ensures the comparability and robustness of heterogeneous data sources before fusion and modeling.

3.4. Relevance Cycle

Following PRISMA [22], a systematic review was conducted to identify variables relevant to pest outbreaks in pitahaya crops. Literature emphasized the chlorophyll index as a proxy for plant stress and early pest indication. Figure 5 shows the PRISMA framework used for variable selection.

Independent variables included meteorological (temperature, precipitation, humidity, wind speed), agronomic (irrigation frequency, pruning practices), and ecological (canopy cover, adjacent vegetation type) factors. Figure 6 shows the problem setting.

3.5. Design Cycle: Integration of CRISP-DM

This cycle develops and tests predictive artifacts using CRISP-DM and multimodal data fusion strategies (early and late fusion). Figure 7 illustrates the fusion strategies adopted.

The six CRISP-DM phases were strictly followed, including PCA for dimensionality reduction due to its scalability and interoperability.

3.5.1. Business Understanding

The objective of this study is to develop a predictive system capable of identifying and managing pest and disease risks in pitahaya crops, optimizing productivity and sustainability. The predictive modeling is designed to integrate heterogeneous data sources (environmental and physiological) and evaluate the impact of treatment conditions on plant health.

3.5.2. Data Understanding

Two datasets were collected to characterize the cultivation environment and the physiological status of the plants:

Sensor data: Collected from January 2022 to December 2023 using weather stations (accuracy: ±0.5 °C for temperature, ±3% for RH) recording at 5 min intervals, yielding

n = 142,368

observations.

Plant data: Chlorophyll measurements (

n = 1248

) taken weekly using SPAD-502Plus meters from 48 grafted and 48 ungrafted plants across 6 experimental plots.

3.5.3. Data Preparation

Missing Data Handling: Linear interpolation was used for short-term gaps (<1 h) in sensor data, while KNN imputation (k = 3) was applied to plant measurements. This process reduced missing values from 8.7% to 0.9%.

Dimensionality Reduction: PCA was applied to reduce the feature space while preserving the underlying variance. This step was crucial due to the heterogeneous nature of the data and the need for interoperable analysis pipelines.

3.5.4. Modeling

Early Fusion

Two distinct datasets were integrated early in the process:

Environmental dataset: Includes meteorological and agronomic variables such as temperature, humidity, and pest incidence [23]. See Table 1.
Physiological dataset: Includes variables such as chlorophyll index, fruit size, and visual symptoms of disease [24]. See Table 2.

Variable Description

Table 3 lists and describes all relevant variables used in the fused dataset.

3.5.5. Evaluation

The fused dataset was used to train multiple models, and performance metrics such as accuracy, F1-score, and AUC were compared across early and late fusion strategies. Initial results suggest that early fusion improves the interpretability and correlation analysis between environmental stressors and plant responses.

3.5.6. Deployment

The final models are intended to be deployed in a mobile decision-support tool for pitahaya growers, allowing real-time alerts and recommendations for crop management, particularly under changing climatic conditions.

The relevance cycle provided the conceptual foundation and variable selection criteria, which informed the subsequent design cycle focused on predictive modeling and data fusion.

3.6. Rigor Cycle

The model was evaluated via cross-validation and tested in real-world conditions in two plots not used during training. Performance metrics included accuracy, precision, recall, and F1-score, computed using stratified 10-fold cross-validation. This evaluation ensured generalization capacity and applicability to other tropical crops.

SCORE Metric

In this study, the SCORE metric quantifies the classification performance after dimensionality reduction. It is defined as follows:

SCORE = \frac{\sum_{i = 1}^{N} δ (y_{i}, {\hat{y}}_{i})}{N} - α \cdot \frac{1}{M} \sum_{j = 1}^{M} |σ_{j} - {\tilde{σ}}_{j}|

(1)

where

$y_{i}$ is the true label and ${\hat{y}}_{i}$ is the predicted label for instance i;
$δ (y_{i}, {\hat{y}}_{i})$ is the Kronecker delta function (1 if equal, 0 otherwise);
N is the total number of test instances;
$σ_{j}$ is the variance of the original feature j;
${\tilde{σ}}_{j}$ is the variance of the same feature after dimensionality reduction;
M is the number of features;
$α \in [0, 1]$ is a penalty weight controlling sensitivity to information loss;
$σ_{j}$ is the original variance of feature j before PCA;
${\tilde{σ}}_{j}$ is the retained variance of the transformed feature j;
M is the number of features after dimensionality reduction;
$α$ is a regularization parameter controlling the trade-off between accuracy and variance preservation.

This metric jointly evaluates classification accuracy and information preservation, making it suitable for high-dimensional agroecological datasets [25].

4. Results

To establish a performance benchmark, we trained the same predictive models using the original feature set without applying any dimensionality reduction. This baseline resulted in an accuracy of 0.9123, which is lower than the best-performing models using PCA (0.9699) and KPCA (poly) (0.9624). These results highlight the benefit of applying dimensionality reduction, not only to improve computational efficiency but also to enhance model generalization and clustering structure.

4.1. Experiment 1: Early Fusion

The interpretation of the SCORE becomes relevant when comparing different dimensionality reduction methods, allowing for the determination of which one is most suitable for a specific task based on the nature of the data and the analysis objectives. Therefore, this metric plays a fundamental role in the selection of preprocessing techniques that optimize data representation and facilitate the extraction of meaningful insights [26].

In the analysis of dimensionality reduction and performance evaluation of different methods, the following results were observed:

KPCA with a linear kernel: This method achieved an accuracy of 0.9624 and a silhouette score of 0.2405, indicating a notable performance. This suggests that the use of the linear kernel was effective in capturing the linear relationships present in the data, providing a representation that favors high predictive capability.

y_{i} = \sum_{j = 1}^{N} α_{j} K (x_{j}, x_{i})

(2)

KPCA with a polynomial kernel: This method achieved an accuracy of 0.9248 and a silhouette score of 0.3244. Although these values were lower than those obtained with the linear kernel, they demonstrate its ability to capture non-linear relationships between the variables. This performance highlights the utility of the polynomial kernel in contexts where the interactions between features do not follow a strictly linear pattern.

K (x_{i}, x_{j}) = {(〈 x_{i}, x_{j} 〉 + c)}^{d}

(3)

KPCA with an RBF kernel: With an accuracy of 0.8496 and a silhouette score of 0.1231, this method yielded results inferior to those achieved with the linear and polynomial kernels. While the RBF kernel is known for its ability to handle complex and non-linear relationships, in this case, its performance was limited compared to the other evaluated alternatives.

K = {[K (x_{i}, x_{j})]}_{i, j = 1}^{n} = [exp (- \frac{∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}})]

(4)

PCA (Principal Component Analysis): This method achieved the highest accuracy among all those evaluated, with a value of 0.9699 and a silhouette score of 0.2405. These results demonstrate that PCA was highly effective in capturing the most significant variance in the data, offering a representation that enhances the predictive capability of the model.

C = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T}

(5)

IPCA (Incremental Principal Component Analysis): This method achieved an accuracy of 0.9398 and a silhouette score of 0.2329, reflecting a notable performance. Although slightly less effective than PCA, it stands out for its ability to process large volumes of data in batches, making it a valuable option in environments with limited computational resources.

C = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ) {(x_{i} - μ)}^{T}

(6)

Regarding the dimensionality reduction methods oriented toward visualization, t-SNE and UMAP, the results obtained reflect a lower capacity for structured separation of the classes in the projected space.

t-SNE (t-Distributed Stochastic Neighbor Embedding) models the similarities between data points by computing probabilities in the original (high-dimensional) space:

\begin{matrix} p_{j | i} & = \frac{exp (- \frac{∥ x_{i} - x_{j} ∥^{2}}{2 σ_{i}^{2}})}{\sum_{k \neq i} exp (- \frac{∥ x_{i} - x_{k} ∥^{2}}{2 σ_{i}^{2}})} \end{matrix}

(7)

\begin{matrix} p_{i j} & = \frac{p_{j | i} + p_{i | j}}{2 n} \end{matrix}

(8)

Probabilities in the embedded (low-dimensional) space:

q_{i j} = \frac{(1 + ∥ y_{i} - y_{j} {∥^{2})}^{- 1}}{\sum_{k \neq l} (1 + ∥ y_{k} - y_{l} {∥^{2})}^{- 1}}

(9)

Cost function (KL divergence):

C = \sum_{i} \sum_{j} p_{i j} log \frac{p_{i j}}{q_{i j}}

(10)

UMAP (Uniform Manifold Approximation and Projection) focuses on preserving similarities in the original space, allowing for a more faithful low-dimensional representation:

v_{j | i} = exp (\frac{- ∥ x_{i} - x_{j} ∥^{2} + ρ_{i}}{σ_{i}})

(11)

where

ρ_{i}

is the distance to the nearest neighbor.

Symmetric probabilities:

w_{i j} = v_{j | i} + v_{i | j} - v_{j | i} v_{i | j}

(12)

Similarities in the embedded space:

w_{i j}^{'} = (1 + a ∥ y_{i} - y_{j} {∥^{2 b})}^{- 1}

(13)

where a and b are parameters.

Cost function (cross-entropy):

C = \sum_{i \neq j} [w_{i j} log (\frac{w_{i j}}{w_{i j}^{'}}) + (1 - w_{i j}) log (\frac{1 - w_{i j}}{1 - w_{i j}^{'}})]

(14)

In the tests performed, t-SNE achieved a silhouette score of 0.0473, while UMAP obtained 0.0724, indicating that both techniques presented low intra-class cohesion and poor separation between groups (see Figure 8 and Figure 9).

The KPCA (poly) method offers a very good compromise between classification accuracy and the quality of data separation. Although its score of 0.9248 is slightly lower than that achieved by PCA and KPCA (linear), it remains quite high and competitive. What truly stands out about this technique is its silhouette score of 0.3245, the highest among all the evaluated methods, indicating a better structuring and separation of the groups in the reduced space (see Table 4).

This combination of good accuracy and a notable ability to capture non-linear relationships makes KPCA with a polynomial kernel a particularly valuable alternative in scenarios where it is crucial to adequately represent the internal structures of the dataset to improve interpretation or enhance the performance of subsequent models (see Figure 10).

Linear and kernel-based techniques, such as PCA, IPCA, and KPCA, tend to show higher score values, which could be interpreted as better preservation of variance in dimensionality reduction or as higher performance in classification tasks, in case the score refers to the model’s accuracy.

4.2. Experiment 2: Fusion Late

The first dataset corresponds to meteorological observations recorded in the pitahaya cultivation environment during the years 2022 and 2023. Each record represents a specific measurement of environmental and categorical variables oriented toward the analysis of shade conditions and light coverage. Parameters such as temperature, relative humidity, dew point, and constant wind speeds are included. The variables “month” and “year” originate from data captured by sensors configured to record information every five minutes during the aforementioned period. Likewise, the “shadow” index quantifies the perceived shade intensity, while the “group” variable, generated through clustering techniques, allows for the segmentation of samples into groups with similar characteristics (see Table 5).

In the process of late data fusion, dimensionality reduction is applied independently to each dataset before proceeding with their integration. This approach simplifies the representation of each dataset, eliminating redundant or irrelevant variables that could introduce noise and hinder subsequent integration. Through methods such as Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE), the number of data dimensions is reduced, preserving the most significant underlying relationships. After this reduction, the datasets are fused, which optimizes the joint analysis process, improving the model’s efficiency and effectiveness. This approach contributes to minimizing computational complexity and improving the ability to identify patterns or correlations between the various sources of information.

Dimensionality reduction methods, both linear and kernel-based, exhibit variable performance depending on the type of kernel used and the metric employed for their evaluation. In this context, the KPCA method with a polynomial kernel stands out as the most balanced option, achieving the highest score (0.9398) and the highest silhouette score (0.3534), which indicates adequate classification capability and a projection that effectively preserves the structure of the data. Both KPCA with a linear kernel and PCA show identical results in both metrics (score of 0.9323 and silhouette score of 0.1754), suggesting that the use of a linear kernel in KPCA does not offer additional advantages over a traditional linear projection like the one performed by PCA.

For its part, KPCA with an RBF kernel exhibits a considerably lower score (0.8045) and a negative silhouette score (−0.0053), which could be interpreted as poor separation between classes after reduction, possibly due to overfitting or a poor adaptation of the kernel to the dataset. As for IPCA, this method shows intermediate performance, with a score of 0.8797 and a silhouette score of 0.1408, representing a decrease in accuracy compared to PCA (see Table 6).

Finally, although t-SNE and UMAP are not oriented toward classification tasks and therefore do not have associated scores, their respective silhouette scores (0.1370 and 0.0474) indicate a limited ability to clearly separate groups in the projected space. This can be attributed to their primary focus on preserving local rather than global relationships (see Figure 11 and Figure 12).

Taken together, the results obtained reinforce the conclusion that KPCA with a polynomial kernel offers the best combination of structured projection and classification performance among the evaluated methods (see Figure 13).

The second dataset originates from an experiment conducted during the years 2022 and 2023, focused on measuring the chlorophyll content in pitahaya plants (Hylocereus spp.). Each record represents an individual observation, in which contextual and numerical variables relevant to the experimental design are integrated. Among these are the month and year of collection, as well as a description field indicating whether the pitahaya plant is grafted or ungrafted. Variables related to the applied treatment, its repetition, and the plot location are also recorded. Additionally, quantitative measures such as indexcl, which represents the chlorophyll index, and bringht are considered. Finally, the variable group, generated through unsupervised clustering techniques, allows for the classification of observations based on similarity patterns among the different records (see Table 7).

The comparative analysis reveals that the KPCA (linear), PCA, and IPCA methods achieve the best results, reaching a maximum score of 0.9549 and consistent silhouette score values around 0.225, which demonstrates their ability to effectively preserve the intrinsic structure of the data, making them robust options for clustering processes (see Table 8).

Upon examining the variants of the KPCA method, it is observed that KPCA (poly) exhibits a particular performance: although its score (0.9023) is slightly lower than that of other methods, it registers the highest silhouette score (0.2743), suggesting a better definition of clusters despite presenting a less optimal dimensional reduction. In stark contrast, KPCA (rbf) demonstrates the poorest performance of the group, with a score of 0.8496 and a notably low silhouette score (0.1117), a situation that could be attributed to overfitting problems or a suboptimal selection of the kernel parameter (see Figure 14).

In the case of non-linear techniques, t-SNE and UMAP present a significant peculiarity: while score values are not reported for these methods, their extremely low silhouette scores (0.0996 and 0.0280, respectively) highlight a fundamental limitation. Despite their recognized effectiveness for visualization tasks, these algorithms do not adequately maintain the global structures necessary for clustering (see Figure 15).

Based on these findings, it is suggested that PCA, IPCA, or KPCA (linear) be considered as preferred options for clustering applications due to their demonstrated consistency. KPCA (poly) emerges as a valuable alternative when prioritizing the balance between the score metric and the quality of cluster separation. It is important to note that t-SNE and UMAP, despite their utility in data visualization, are not suitable for clustering purposes, as their design is not oriented toward the preservation of the essential metrics for this type of analysis (see Figure 16). The analysis of the merged sensor and chlorophyll data reveals that the linear methods KPCA (linear) and PCA achieve the best results, with a maximum score of 0.9436 and a consistent silhouette score (0.2003), demonstrating their effectiveness in integrating diverse data. IPCA, although with a slightly lower performance (score: 0.9173), remains a viable alternative thanks to its comparable silhouette score (0.1828). These results highlight the robustness of linear approaches for this type of analysis (see Table 9).

Among the non-linear methods, KPCA (poly) stands out by achieving the highest silhouette score (0.3138), making it ideal for applications where cluster separation is critical, despite its somewhat lower score (0.9211). Conversely, KPCA (rbf) exhibits the worst performance (score: 0.8271, silhouette: 0.0532), indicating its poor suitability for these data. Finally, techniques such as t-SNE and UMAP, while useful for visualization, show very low silhouette scores (0.1183 and 0.0377, respectively), confirming their limitation for clustering tasks and relegating their use mainly to exploratory purposes.

The linear methods KPCA (linear) and PCA achieve the best results in dimensionality reduction, with scores of up to 0.9624 in early fusion, consistently outperforming late fusion by an average margin of 2.1%. This advantage is maintained in cluster quality, where KPCA (poly) leads with silhouette scores of 0.3245, demonstrating that early fusion not only optimizes processing but also improves cluster definition by 12.6% compared to late fusion (see Table 10).

However, the results show significant limitations in some methods: KPCA (rbf) exhibits the worst performance among the kernel techniques, while t-SNE and UMAP are inadequate for clustering, displaying marginal silhouette scores. These findings suggest that, for the integration of sensor-chlorophyll data, the optimal combination would be PCA with early fusion for general dimensionality reduction, or KPCA (poly) when cluster separation is prioritized, discarding t-SNE/UMAP for analytical purposes due to their low effectiveness in preserving clusterable structures (see Figure 17).

4.3. Experiment 3: Friedman Test and Nemenyi Post Hoc Test

The Friedman test is applied to determine if there is a significant difference between the evaluated methods, yielding a chi-squared (

χ^{2}

) statistic = 12.0000 and a p-value = 0.0174. Given that the p-value is less than 0.05, it is concluded that there are significant differences between at least two of the methods. Consequently, it is evident that the performance metrics (score and silhouette score) of at least one pair of methods differ significantly. This result constitutes the first indication of variability in the performance of the analyzed methods.

χ_{F}^{2} = \frac{12}{n k (k + 1)} \sum_{j = 1}^{k} R_{j}^{2} - 3 n (k + 1)

(15)

Due to the existence of a significant difference, the Nemenyi post hoc test is applied to compare all possible pairs of methods. As a result, significant comparisons are identified. Firstly, when comparing KPCA (linear) with KPCA (rbf), a p-value of 0.0564 is obtained, which is just below the 0.05 threshold, indicating a significant difference between these two methods in terms of the evaluated metrics. Although the p-value is very close to the limit, it is sufficient to conclude that the performance of KPCA (linear) and KPCA (rbf) differs significantly.

C D = q_{α} \cdot \sqrt{\frac{k (k + 1)}{6 n}}

(16)

On the other hand, the comparison between KPCA (poly) and KPCA (rbf) yields a p-value of 0.0999, a value greater than 0.05, suggesting that there is no significant difference between these two methods. Similarly, when comparing PCA with KPCA (rbf), a p-value of 0.0564 is obtained, again evidencing a significant difference in performance, consistent with what was observed between KPCA (linear) and KPCA (rbf) (see Table 11).

In contrast, several comparisons do not show significant differences. Specifically, the pairs KPCA (linear) vs. PCA, KPCA (poly) vs. PCA, PCA vs. IPCA, among others, present p-values of 1.0000 or greater than 0.0564, indicating the absence of substantial differences in performance between these methods.

5. Discussion

5.1. Comparative Analysis with Previous Studies

Our finding that early fusion outperforms late fusion in pest prediction for pitahaya crops (accuracy = 0.9624 for linear KPCA vs. 0.94355) is consistent with the work of Duan et al. [27], who highlighted the importance of preserving feature relationships through early integration. This is particularly evident in our linear methods, where the structural integrity of meteorological and plant physiology data are better maintained in early fusion (Table 1 and Table 2). Similarly, our results with the polynomial kernel (early fusion = 0.9248 vs. late = 0.921) align with Guo et al. [28]’s findings on the effectiveness of this kernel in capturing non-linear patterns within complex agricultural datasets. The improved performance of the RBF kernel with early fusion (0.8496 vs. 0.82705) further supports Strelet et al. [29]’s observations regarding the benefits of early integration for non-linear pattern recognition. Our PCA results (early fusion = 0.9699) corroborate Qin et al. [30]’s emphasis on variance preservation in unified data representations, while the IPCA findings (0.9398 early vs. 0.9172 late) echo Hakim et al. [31]’s conclusions about the advantages of incremental processing in fused datasets. The statistical validation through the Friedman test and Nemenyi post hoc analysis, which identified KPCA (poly) as achieving the highest silhouette score (0.3245), reinforces its utility for applications prioritizing cluster separation, a point visually supported by Figure 10 and Figure 17.

5.2. Technical Implications of Methodological Choices

The selection of dimensionality reduction techniques in this study carries significant technical implications. The superior performance of linear methods like PCA and linear KPCA for early fusion suggests that the underlying relationships between our meteorological and chlorophyll data, when integrated early, are largely linear. This favors computationally less intensive techniques for achieving high predictive accuracy. Conversely, the effectiveness of KPCA with a polynomial kernel in achieving the best cluster separation indicates the presence of non-linear structures that this kernel can effectively capture, albeit with a slightly higher computational cost than linear methods. The relatively poor performance of t-SNE and UMAP, despite their utility in visualization as noted by Jansen et al. [26], underscores that methods optimized for dimensionality reduction for visualization may not be suitable for predictive modeling or clustering tasks where the preservation of the original data structure is crucial.

In future extensions of this work, we plan to explore deep learning-based dimensionality reduction methods, particularly autoencoders, which have shown promise in capturing complex non-linear feature interactions in high-dimensional agricultural datasets. Autoencoders can serve both as a dimensionality reduction technique and as a feature selection mechanism by learning compressed latent representations that preserve the most salient information. This approach could complement or replace kernel-based techniques, especially in scenarios with large and diverse datasets.

Our rigorous preprocessing steps, including KNN imputation to handle missing data as advocated by Boddu et al. [32], were essential in ensuring the quality of the input data and the reliability of the performance metrics obtained. The high accuracy of our sensor data (

\pm 0.5

°C,

\pm 3 %

RH) provided a solid foundation for the data fusion process.

5.3. Limitations of Adopted Approaches

Despite the promising results, our study has several limitations. The reliance on specific sensor types (Davis Vantage Pro2 and SPAD-502Plus) might limit the generalizability of our findings to contexts with different sensor technologies. While KNN imputation effectively handled missing data, it introduces a degree of estimation that could affect the true underlying data distribution. Furthermore, the computational cost associated with non-linear kernel methods like KPCA (poly) could pose challenges for real-time deployment on resource-constrained hardware commonly found in small-scale agricultural settings. The observed limitations of t-SNE and UMAP for clustering tasks suggest that alternative non-linear dimensionality reduction techniques might be worth exploring in future work if cluster separation is a primary objective. Finally, while our field implementation in the Joya de los Sachas canton showed a significant improvement in pest prediction accuracy, further validation across different agroecological zones and pitahaya varieties is needed to ensure the robustness and scalability of our proposed framework, as cautioned by Anjali et al. [24] regarding the transferability of precision agriculture techniques. The potential for sensor drift and the complexities of data integration across diverse sensor ecosystems also represent ongoing challenges for real-world deployment.

5.4. Implementation Considerations for Edge Deployment

Given the growing need for localized decision-making in precision agriculture, the feasibility of implementing our model on low-power edge devices is critical. Models based on linear PCA and IPCA are particularly well-suited for execution on microcontrollers and embedded systems due to their low computational demands and memory footprint. In contrast, KPCA and kernel-based methods may require optimization or hardware acceleration to be viable in such environments. Future work should explore quantization and model pruning techniques to further reduce the complexity of trained models, as well as deployment on platforms such as Raspberry Pi or NVIDIA Jetson for real-time field testing.

5.5. Implications and Future Research Directions

Our findings have important implications for precision agriculture:

Early fusion should be preferred when integrating heterogeneous agricultural data sources.
KPCA (poly) offers the best balance between accuracy and cluster separation.
Linear methods (PCA/IPCA) remain robust choices for general dimensionality reduction.

Future research directions could explore:

Hybrid fusion approaches combining early and late strategies.
Adaptive kernel selection in KPCA.
Real-time implementation for field deployment using resource-constrained hardware.
Evaluating the applicability of the proposed fusion methods in other specialty crops such as coffee, blueberries, and avocados, or across different agroecological regions facing similar pest management challenges.
Proposing a cross-regional validation experiment using datasets from contrasting ecological zones in Ecuador (e.g., Sierra vs. Amazonía), to evaluate the generalizability of the fusion strategy and adjust it according to environmental variability.
Investigating transfer learning approaches to adapt trained models to new crops with minimal retraining.
Evaluating deep autoencoder-based models as a scalable alternative to traditional dimensionality reduction, particularly in high-volume sensor networks where latent space representations can improve both accuracy and efficiency.

The successful application of these techniques in pitahaya cultivation demonstrates their potential for other specialty crops facing similar pest management challenges [33]. However, as [24] cautions, domain-specific adaptations may be necessary when transferring these methods to different agricultural systems, considering variations in pest dynamics, sensor availability, and environmental factors.

In conclusion, our systematic evaluation establishes early data fusion as the superior approach for integrating sensor and manual measurements in pitahaya pest prediction. The methodological insights and performance benchmarks provided in this study (summarized in Table 4 and Table 10) offer a valuable framework for agricultural data integration and predictive modeling.

6. Conclusions

This study demonstrates the effectiveness of data fusion and dimensionality reduction techniques for pest prediction in pitahaya crops, offering valuable contributions to agricultural data science. Early fusion was found to outperform late fusion across all evaluated methods (KPCA, PCA, IPCA), achieving higher accuracy (0.9624 vs. 0.9436) and better cluster separation, thus confirming the advantage of integrated data processing for preserving critical relationships between environmental and plant variables. PCA was the most effective linear method, while KPCA with a polynomial kernel provided an optimal balance for non-linear pattern recognition. However, limitations were identified in the use of t-SNE and UMAP, as they showed poor performance in clustering tasks, which underscores their limited applicability for predictive modeling. Statistical validation using the Friedman test and Nemenyi post hoc analysis confirmed significant performance differences, particularly between linear and non-linear kernel approaches. Despite these findings, the study is constrained by the reliance on a single crop (pitahaya) and specific sensor configurations, suggesting that future research should focus on exploring hybrid fusion approaches, adaptive kernel selection in KPCA, real-time implementation on edge devices, and the extension of these methods to other tropical crops with similar data characteristics. Additionally, future work could aim to refine the practical application of these techniques by deploying systems in operational environments involving local farmers in pilot programs to validate the models under real-world conditions. These steps would contribute to enhancing pest management practices and promote sustainable agricultural practices by reducing chemical usage and crop loss.

Author Contributions

Conceptualization, W.C., G.M.-F. and P.P.; methodology, M.M.-F., J.E. and S.L.; software, J.E. and J.S.; validation, M.M.-F., G.M.-F. and P.P.; formal analysis, W.C. and S.L.; investigation, W.C., M.M.-F., J.E., G.M.-F., S.L., P.P. and J.S.; resources, G.M.-F. and J.S.; data curation, M.M.-F. and J.E.; writing—original draft preparation, W.C. and P.P.; writing—review and editing, W.C., M.M.-F., J.E., G.M.-F., S.L., P.P., J.S.; visualization, S.L. and J.S.; supervision, W.C. and G.M.-F.; project administration, M.M.-F. and P.P.; funding acquisition, W.C. and G.M.-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We express our sincere gratitude to Wilson Chango from PUCE Esmeraldas for his invaluable technical assistance and guidance in configuring the sensor systems and data acquisition processes for this research. This study was supported by PUCE Esmeraldas. Additionally, we acknowledge the facilities and general support provided by the Department of Computer Sciences, which greatly contributed to the successful completion of this work).

Conflicts of Interest

The authors declare that they have no conflicts of interest related to this article.

References

Diéguez-Santana, K.; Sarduy-Pereira, L.B.; Sablón-Cossío, N.; Bautista-Santos, H.; Sánchez-Galván, F.; Ruíz Cedeño, S.D.M. Evaluation of the Circular Economy in a Pitahaya Agri-Food Chain. Sustainability 2022, 14, 2950. [Google Scholar] [CrossRef]
Haddad, E.A.; Araújo, I.F.; Feltran-Barbieri, R.; Perobelli, F.S.; Rocha, A.; Sass, K.S.; Nobre, C.A. Economic Drivers of Deforestation in the Brazilian Legal Amazon. Nat. Sustain. 2024, 7, 1141–1148. [Google Scholar] [CrossRef]
Vinod, V.C.; Anand Hareendran, S.; Albaaji, G.F. Precision Farming for Sustainability: An Agricultural Intelligence Model. Comput. Electron. Agric. 2024, 226, 109386. [Google Scholar] [CrossRef]
Armstrong, L. Decision-support systems for pest monitoring and management. In Decision Support Systems for Sustainable Pest Management; Burleigh Dodds Science Publishing: Cambridge, UK, 2020; pp. 205–234. [Google Scholar]
Tharranum, A.M.; Singh, K.K.; Pandey, A.C.; Singh, Y.P.; Kandpal, B.K. Evaluation of forewarning models for mustard aphids in different agro-climatic zones of India. Int. J. Biometeorol. 2020, 64, 445–460. [Google Scholar] [CrossRef]
Zhang, Z.; Zhu, L. A Review on Unmanned Aerial Vehicle Remote Sensing: Platforms, Sensors, Data Processing Methods, and Applications. Drones 2023, 7, 398. [Google Scholar] [CrossRef]
Awais, M.; Li, W.; Cheema, M.J.M.; Zaman, Q.U.; Shaheen, A.; Aslam, B.; Zhu, W.; Ajmal, M.; Faheem, M.; Hussain, S.; et al. UAV-Based Remote Sensing in Plant Stress Imaging Using High-Resolution Thermal Sensor for Digital Agriculture Practices: A Meta-Review. Int. J. Environ. Sci. Technol. 2022, 20, 1135–1152. [Google Scholar] [CrossRef]
Dai, M.; Shen, Y.; Li, X.; Liu, J.; Zhang, S.; Miao, H. Digital Twin System of Pest Management Driven by Data and Model Fusion. Agriculture 2024, 14, 1099. [Google Scholar] [CrossRef]
Mittal, M.; Gupta, V.; Aamash, M.; Upadhyay, T. Machine Learning for Pest Detection and Infestation Prediction: A Comprehensive Review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1551. [Google Scholar] [CrossRef]
Ali, M.A.; Sharma, A.K.; Dhanaraj, R.K. Multi-Features and Multi-Deep Learning Networks to Identify, Prevent and Control Pests in Tremendous Farm Fields Combining IoT and Pests Sound Analysis. Res. Sq. 2024. preprint. [Google Scholar] [CrossRef]
Tinoco-Jaramillo, L.; Vargas-Tierras, Y.; Habibi, N.; Caicedo, C.; Chanaluisa, A.; Paredes-Arcos, F.; Viera, W.; Almeida, M.; Vásquez-Castillo, W. Agroforestry Systems of Cocoa (Theobroma cacao L.) in the Ecuadorian Amazon. Forests 2024, 15, 195. [Google Scholar] [CrossRef]
Dieguez-Santana, K.; Sarduy-Pereira, L.; Ruiz-Reyes, E.; Sablón Cossío, N. Application of the Circular Economy in Research in the Agri-Food Supply Chain: Bibliometric, Network, and Content Analysis. Sustainability 2025, 17, 1899. [Google Scholar] [CrossRef]
Şentürk, Ş.; Şentürk, F.; Karaca, H. Industry 4.0 Technologies in Agri-Food Sector and Their Integration in the Global Value Chain: A Review. J. Clean. Prod. 2023, 408, 137096. [Google Scholar] [CrossRef]
Manjunath, M.S.; Koshariya, A.K.; Sharma, N.; Rajput, A.; Pandey, S.K.; Singh, S.; Kumar, R.; Singh, B.V. Exploring the Use of Aromatic Compounds in Crop Growth and Protection. Int. J. Plant Soil Sci. 2023, 35, 78–89. [Google Scholar] [CrossRef]
Aljawasim, B.D.; Samtani, J.B.; Rahman, M. New Insights in the Detection and Management of Anthracnose Diseases in Strawberries. Plants 2023, 12, 3704. [Google Scholar] [CrossRef]
Meng, T.; Jing, X.; Yan, Z.; Pedrycz, W. A Survey on Machine Learning for Data Fusion. Inf. Fusion 2020, 57, 115–129. [Google Scholar] [CrossRef]
Chango, W.; Lara, J.A.; Cerezo, R.; Romero, C. A Review on Data Fusion in Multimodal Learning Analytics and Educational Data Mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1458. [Google Scholar] [CrossRef]
Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal Deep Learning for Biomedical Data Fusion: A Review. Briefings Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
Qiu, S.; Zhao, H.; Jiang, N.; Wang, Z.; Liu, L.; An, Y.; Zhao, H.; Miao, X.; Liu, R.; Fortino, G. Multi-Sensor Information Fusion Based on Machine Learning for Real Applications in Human Activity Recognition: State-of-the-Art and Research Challenges. Inf. Fusion 2022, 80, 241–265. [Google Scholar] [CrossRef]
Karim, S.; Tong, G.; Li, J.; Qadir, A.; Farooq, U.; Yu, Y. Current Advances and Future Perspectives of Image Fusion: A Comprehensive Review. Inf. Fusion 2023, 90, 185–217. [Google Scholar] [CrossRef]
Betz, S.; Penzenstadler, B.; Duboc, L.; Chitchyan, R.; Kocak, S.A.; Brooks, I.; Oyedeji, S.; Porras, J.; Seyff, N.; Venters, C.C. Lessons Learned from Developing a Sustainability Awareness Framework for Software Engineering Using Design Science. ACM Trans. Softw. Eng. Methodol. 2024, 33, 136. [Google Scholar] [CrossRef]
Chabalala, K.; Boyana, S.; Kolisi, L.; Thango, B.; Matshaka, L. Digital Technologies and Channels for Competitive Advantage in SMEs: A Systematic Review. Preprints 2024, 2024100020. [Google Scholar] [CrossRef]
Omia, E.; Bae, H.; Park, E.; Kim, M.S.; Baek, I.; Kabenge, I.; Cho, B.K. Remote Sensing in Field Crop Monitoring: A Comprehensive Review of Sensor Systems, Data Analyses and Recent Advances. Remote Sens. 2023, 15, 354. [Google Scholar] [CrossRef]
Anjali; Jena, A.; Bamola, A.; Mishra, S.; Jain, I.; Pathak, N.; Sharma, N.; Joshi, N.; Pandey, R.; Kaparwal, S.; et al. State-of-the-Art Non-Destructive Approaches for Maturity Index Determination in Fruits and Vegetables: Principles, Applications, and Future Directions. Food Prod. Process. Nutr. 2024, 6, 56. [Google Scholar] [CrossRef]
Gikera, R.; Maina, E.; Mambo, S.M.; Mwaura, J. K-hyperparameter tuning in high-dimensional genomics using joint optimization of deep differential evolutionary algorithm and unsupervised transfer learning from intelligent GenoUMAP embeddings. Int. J. Inf. Technol. 2024, 17, 1679–1701. [Google Scholar] [CrossRef]
Jansen, B.J.; Aldous, K.K.; Salminen, J.; Almerekhi, H.; Gyo Jung, S. Data Preprocessing. In Synthesis Lectures on Information Concepts, Retrieval, and Services; Springer: Cham, Switzerland, 2024; Volume Part F1359, pp. 65–75. [Google Scholar] [CrossRef]
Duan, J.; Xiong, J.; Li, Y.; Ding, W. Deep learning based multimodal biomedical data fusion: An overview and comparative review. Inf. Fusion 2024, 112, 102536. [Google Scholar] [CrossRef]
Guo, M.; Wang, K.; Lin, H.; Wang, L.; Cao, L.; Sui, J. Spectral data fusion in nondestructive detection of food products: Strategies, recent applications, and future perspectives. Compr. Rev. Food Sci. Food Saf. 2024, 23, e13301. [Google Scholar] [CrossRef] [PubMed]
Strelet, E.; Peng, Y.; Castillo, I.; Rendall, R.; Wang, Z.; Joswiak, M.; Braun, B.; Chiang, L.; Reis, M.S. Multi-source and multimodal data fusion for improved management of a wastewater treatment plant. J. Environ. Chem. Eng. 2023, 11, 111530. [Google Scholar] [CrossRef]
Qin, Z.; Zhao, P.; Zhuang, T.; Deng, F.; Ding, Y.; Chen, D. A survey of identity recognition via data fusion and feature learning. Inf. Fusion 2023, 91, 694–712. [Google Scholar] [CrossRef]
Hakim, S.B.; Adil, M.; Acharya, K.; Song, H.H. Decoding Android Malware with a Fraction of Features: An Attention-Enhanced MLP-SVM Approach. arXiv 2024, arXiv:2409.19234. [Google Scholar]
Boddu, Y.; Manimaran, A. Maximizing Forecasting Precision: Empowering Multivariate Time Series Prediction with QPCA-LSTM. Comput. Econ. 2024, 2, 1–36. [Google Scholar] [CrossRef]
Li, C.; Wang, M. Pest and disease management in agricultural production with artificial intelligence: Innovative applications and development trends. Adv. Resour. Res. 2024, 4, 381–401. [Google Scholar] [CrossRef]

Figure 1. Early fusion: combines features before model training.

Figure 2. Late fusion: combines outputs of separately trained models.

Figure 3. Hybrid fusion: flexible combination for robust analysis.

Figure 4. DSR: This methodology has the primary purpose of optimizing pest prediction capabilities in pitahaya crops and, complementarily, establishing a solid foundation for future research aimed at exploring the potential of multimodal data fusion in various agricultural applications.

Figure 5. PRISMA framework for systematic variable selection.

Figure 6. The implementation of a predictive model for pest detection in pitahaya crops, based on advanced data fusion techniques, is proposed. This approach aims to optimize integrated pest management and improve productive efficiency in the Amazon region of Ecuador. The integration of multiple data sources allows for a more precise identification of the factors associated with pest occurrence, facilitating informed decision-making and promoting sustainable agricultural practices in the context of pitahaya production.

Figure 7. The choice between early fusion and late fusion depends directly on the specific characteristics of the data and the nature of the problem to be solved. In the case of data from weather sensors and field monitoring data, early fusion allows for the integration of these sources at an initial stage, combining their features into a single dataset before proceeding with modeling. On the other hand, late fusion treats each data source independently, processing the data separately and subsequently combining the predictions generated by individual models.

Figure 8. t-SNE projection of multidimensional data (early fusion).

Figure 9. UMAP projection of high-dimensional data (early fusion).

Figure 10. Comparison of score and silhouette score for dimensionality reduction techniques.

Figure 11. t-SNE projection of multidimensional data (late fusion, sensor data).

Figure 12. UMAP projection of high-dimensional data (late fusion, sensor data).

Figure 13. Comparison of score and silhouette score for dimensionality reduction techniques (sensor data).

Figure 14. Comparison of score and silhouette score for dimensionality reduction techniques (chlorophyll data).

Figure 15. t-SNE projection of multidimensional data (late fusion, chlorophyll data).

Figure 16. UMAP projection of high-dimensional data (late fusion, chlorophyll data).

Figure 17. Performance comparison of early and late fusion for dimensionality reduction methods (heatmap).

Table 1. Meteorological and pest incidence data for pitahaya cultivation.

Temperature	RH (%)	Dew Point	Wind Dir.	Fruit	Severity	Incidence
22.47	92.96	21.35	145	1	0	0
22.47	92.96	21.35	145	2	0	0
22.47	92.96	21.35	145	3	10	1

Table 2. Evaluation of grafted and ungrafted pitahaya under different conditions.

Month	Description	Treatment	Repetition	Location	Index	Bringht
February	Ungrafted	1	1	S	219	2
February	Ungrafted	1	1	O	224	2

Table 3. Description of variables used in pitahaya cultivation data analysis.

Variable	Description
Rain	(V1) Rain presence indicator
Temperature	(V2) Ambient temperature
rh	(V3) Relative humidity
dew_point	(V4) Dew point
wind_speed	(V5) Wind speed (m/s)
gust_speed	(V6) Gust speed (m/s)
wind_direction	(V7) Direction in degrees
Months	(V8) Measurement month
Year	(V9) Measurement year
Description	(V10) Grafted/ungrafted
Treatment	(V11) Treatment application
Repetition	(V12) Experimental repetition
Ubication	(V13) Measurement location
Indexcl	(V14) Classifier index
Bringht	(V15) Brightness
Shade	(V16) Shade level
Group	Cluster group

Table 4. Comparative performance of dimensionality reduction methods: KPCA Variants vs. PCA, t-SNE, and UMAP.

Method	Score	Silhouette Score
KPCA (linear)	0.96	0.24
KPCA (poly)	0.92	0.32
KPCA (rbf)	0.85	0.12
PCA	0.96	0.24
IPCA	0.94	0.23
t-SNE	0.00	0.05
UMAP	0.00	0.07

Table 5. Weather conditions log with shadow and grouping indicators.

Rain	Temperature	rh	Dew Point	Wind Speed	Gust Speed	Wind Direction	Group
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
0	29.83	65.11	22.78	0.50	1.50	156	0
1	29.83	65.11	22.78	0.50	1.50	156	1
0	29.83	65.11	22.78	0.50	1.50	156	1
1	29.83	65.11	22.78	0.50	1.50	156	0

Table 6. Evaluation metrics (score and silhouette) for sensors using KPCA, PCA, t-SNE, and UMAP.

Method	Score	Silhouette Score
KPCA (linear)	0.93	0.18
KPCA (poly)	0.94	0.35
KPCA (rbf)	0.80	−0.01
PCA	0.93	0.18
IPCA	0.88	0.14
t-SNE	–	0.14
UMAP	–	0.05

Table 7. Chlorophyll index (Indexcl) and luminosity measurements across different treatments, locations, and times.

Month	Year	Treatment	Repeat	Location	Indexcl	Bright	Group
2	2022	1	1	2	219.00	2.00	0
2	2022	1	1	4	224.00	2.00	0
2	2022	1	1	2	191.00	2.00	0
2	2022	1	1	3	215.00	1.00	0
2	2022	1	1	4	329.00	3.00	0
3	2023	1	1	2	260.00	2.00	1
2	2023	1	1	3	194.00	1.00	0
2	2023	1	1	1	300.00	1.00	0
2	2023	1	1	2	261.00	2.00	0
2	2023	1	1	4	335.00	2.00	1
2	2023	1	1	2	200.00	3.00	0
6	2023	1	1	2	188.00	1.00	0

Table 8. Performance evaluation of dimensionality reduction techniques on chlorophyll data.

Method	Score	Silhouette Score
KPCA (linear)	0.95	0.23
KPCA (poly)	0.90	0.27
KPCA (rbf)	0.85	0.11
PCA	0.95	0.23
IPCA	0.95	0.22
t-SNE	–	0.10
UMAP	–	0.03

Table 9. Comparative evaluation of dimensionality reduction methods for merged sensor and chlorophyll data: score and silhouette score.

Method	Score	Silhouette Score
KPCA (linear)	0.94	0.20
KPCA (poly)	0.92	0.31
KPCA (rbf)	0.83	0.05
PCA	0.94	0.20
IPCA	0.92	0.18
t-SNE	–	0.12
UMAP	–	0.04

Table 10. Comparison of Early and Late Fusion Approaches for Dimensionality Reduction.

	Fusion Late		Fusion Early
Method	Score	Silhouette Score	Score	Silhouette Score
KPCA (linear)	0.94	0.20	0.96	0.24
KPCA (poly)	0.92	0.31	0.92	0.32
KPCA (rbf)	0.83	0.05	0.85	0.12
PCA	0.94	0.20	0.96	0.24
IPCA	0.92	0.18	0.94	0.23
t-SNE	0.00	0.12	0.00	0.05
UMAP	0.00	0.04	0.00	0.07

Table 11. Nemenyi Post Hoc Test Results: p-values for multiple comparisons among dimensionality reduction methods.

KPCA (Linear)	KPCA (Poly)	KPCA (rbf)	PCA	IPCA
1.00	1.00	0.06	1.00	0.52
1.00	1.00	0.10	1.00	0.67
0.06	0.10	1.00	0.06	0.80
1.00	1.00	0.06	1.00	0.52
0.52	0.67	0.80	0.52	1.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chango, W.; Mazón-Fierro, M.; Erazo, J.; Mazón-Fierro, G.; Logroño, S.; Peñafiel, P.; Sayago, J. Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation. Computation 2025, 13, 137. https://doi.org/10.3390/computation13060137

AMA Style

Chango W, Mazón-Fierro M, Erazo J, Mazón-Fierro G, Logroño S, Peñafiel P, Sayago J. Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation. Computation. 2025; 13(6):137. https://doi.org/10.3390/computation13060137

Chicago/Turabian Style

Chango, Wilson, Mónica Mazón-Fierro, Juan Erazo, Guido Mazón-Fierro, Santiago Logroño, Pedro Peñafiel, and Jaime Sayago. 2025. "Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation" Computation 13, no. 6: 137. https://doi.org/10.3390/computation13060137

APA Style

Chango, W., Mazón-Fierro, M., Erazo, J., Mazón-Fierro, G., Logroño, S., Peñafiel, P., & Sayago, J. (2025). Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation. Computation, 13(6), 137. https://doi.org/10.3390/computation13060137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Fusion and Dimensionality Reduction for Pest Management in Pitahaya Cultivation

Abstract

1. Introduction

2. Background

2.1. Pests in Pitahaya

2.2. Data Fusion

3. Methodology

3.1. Study Area Characterization

3.2. Experimental Design and Data Acquisition

3.3. Preprocessing Workflow

3.4. Relevance Cycle

3.5. Design Cycle: Integration of CRISP-DM

3.5.1. Business Understanding

3.5.2. Data Understanding

3.5.3. Data Preparation

3.5.4. Modeling

Early Fusion

Variable Description

3.5.5. Evaluation

3.5.6. Deployment

3.6. Rigor Cycle

SCORE Metric

4. Results

4.1. Experiment 1: Early Fusion

4.2. Experiment 2: Fusion Late

4.3. Experiment 3: Friedman Test and Nemenyi Post Hoc Test

5. Discussion

5.1. Comparative Analysis with Previous Studies

5.2. Technical Implications of Methodological Choices

5.3. Limitations of Adopted Approaches

5.4. Implementation Considerations for Edge Deployment

5.5. Implications and Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI