3.1. Electrical Control Signal and Electrical Signal Contaminated with Lead
A total of 320 electrical signals were recorded from 160 plants of
Aloe vera var.
Chinensis. The control signals, which were consistent with previous reports [
31,
40], showed clear differences from the signals of lead-exposed plants. These differences in dynamic behavior were statistically significant across all specimens, as illustrated in
Figure 5.
The results demonstrate consistent and reproducible differences in signal morphology between control and lead-exposed plants. This behavior can be explained by the disruptive effect of heavy metals like lead on physiological processes, particularly those involving ion transport, cellular excitability, and membrane potential regulation [
8,
14]. Lead ions are known to interfere with calcium signaling and potassium channel activities, which are critical in the generation and propagation of bioelectrical signals in plant tissues [
22,
41].
Consistent with these physiological disruptions, the contaminated signals exhibited a noticeable reduction in amplitude and temporal variability across all recordings. Such a reduction aligns with findings in other biosignal studies where metal-induced oxidative stress and membrane depolarization dampen the bioelectrical activity of plant systems [
13,
27]. Importantly, the graphical contrast between the two signal classes indicates that lead contamination produces a systematic and measurable alteration in the signals, rather than random noise or recording artifacts [
42].
This consistent visual difference in signal morphology supports the view that bioelectrical signals can act as reliable indicators of environmental stress, particularly under trace metal contamination [
21]. As shown in
Figure 6, box plots and density plots provide a visual comparison of the two signal classes, illustrating the differences observed in their distributions.
The control group exhibited an average amplitude of 201.95 mV (SD ± 9.84), indicating consistent signal levels with low variability across plants. In contrast, the lead-contaminated signal averaged 150.52 mV (SD ± 8.91), indicating a substantial decrease of approximately 25.4%. These results are consistent with previously reported trends in bioelectrical signal suppression due to metal stress [
10,
11].
From a mechanistic standpoint, this attenuation may arise from lead-induced inhibition of H
+-ATPase activity and impaired ionic balance across membranes, which in turn reduces the membrane potential fluctuations necessary to generate measurable signals [
14]. This result, observed consistently across the sample of plants, supports the potential of
Aloe vera as a real-time biosensor for detecting lead in different environments.
Furthermore, the reproducibility of these measurements, obtained with a low-cost Arduino-based monitoring system, demonstrates the practicality of biosensing with plant electrical signals for environmental diagnostics [
21,
22].
3.2. Signal Preprocessing
To assess the quality and characteristics of the electrical signal captured from
Aloe vera, several statistical and dynamical tests were applied to validate its suitability for further analysis and classification. Specifically, the randomness, non-stationarity, nonlinearity, and long-range correlations were examined, as these are typical properties of bioelectrical signals in plant physiology [
22].
The first test assessed the randomness of the signal using autocorrelation analysis. The autocorrelation function
was computed for different lags
using Equation (
2). In the implementation, a Python-based routine calculated the products
and normalized the sum over the total effective length
.
To visualize this analysis, a stem plot was used (
Figure 7), which is suitable for discrete data and highlights how signal correlations decay across time lags.
Figure 7 shows that for
,
reaches its maximum, while subsequent lags quickly decay toward zero. This confirms that the electrical signal does not follow a repetitive pattern, which is indicative of a random process—a known characteristic in biological and plant electrophysiological signals [
10,
14].
Next, the Augmented Dickey–Fuller (ADF) test (Equation (
3)) was applied to assess whether the signal was stationary or exhibited time-dependent trends. The results yielded a test statistic of
and a
p-value of
. Since
and
was greater than the critical values at the 1%, 5%, and 10% levels, the unit-root null hypothesis could not be rejected. This indicates that the signal is non-stationary, consistent with the time-varying characteristics commonly observed in physiological processes of plants under external stress [
8,
21].
The analysis yielded a BDS statistic of , with a z-score of and a standard deviation of . Since , the null hypothesis of linear independence was rejected, indicating nonlinear dynamics in the signal.
This aligns with other studies showing that plant electrophysiological signals typically display nonlinear dynamics due to the complex interplay of ion transport, membrane polarization, and environmental feedback mechanisms [
11,
27].
To evaluate long-range temporal correlations, Detrended Fluctuation Analysis (DFA) was applied. This method quantifies how signal fluctuations scale with window size, providing an estimate of persistence across time.
For control signals, the DFA exponent was
, consistent with persistent correlations and stable temporal organization. In contrast, Pb-contaminated signals showed a lower exponent of
, reflecting disrupted structure and anti-persistent dynamics (
Figure 8). Similar alterations have been reported in plants under metal stress, where reduced
values are associated with impaired physiological regulation [
13,
41].
Overall, the signal preprocessing results confirmed that the acquired electrical signals exhibit the expected characteristics described in the literature for biosignals in plants—namely randomness, non-stationarity, nonlinearity, and distinct scaling behaviors—supporting their suitability for advanced time-series analysis and classification.
3.6. Parametric Analysis
To model the temporal dynamics of the lead-contaminated
Aloe vera electrical signal, an autoregressive model of order 10 (AR(10)) was implemented. This model estimates the current value of the signal
based on a weighted combination of its 10 previous samples, as expressed in Equation (
12).
The model was trained on the contaminated signal using least-squares estimation.
Figure 11 compares the original signal and the AR(10)-based reconstruction. The AR model captures the general trend and amplitude envelope with high fidelity, including transient fluctuations around index 500 (time), where the signal exhibits strong perturbations likely caused by lead-induced physiological stress.
Quantitatively, the AR(10) model achieved a mean squared error of , a mean absolute error of , and a coefficient of determination of . These values indicate that the model effectively captures the underlying linear structure of the signal, despite the non-stationary and noisy characteristics of biological data. Low error values confirm its predictive ability, while the close to unity reflects correct performance.
These results indicate that the AR(10) model is highly effective in capturing the underlying linear structure of the signal, despite the non-stationary and noisy nature of biological data. The relatively low error metrics confirm its predictive capability, while the value close to 1 suggests excellent fitting.
AR modeling has been widely used in plant electrophysiology to approximate physiological patterns, especially under abiotic stress or toxic exposure, due to its interpretability and capacity to quantify temporal dependencies [
46,
47,
48,
49]. Studies have shown that AR models not only reduce computational complexity but also retain physiologically meaningful coefficients that can be linked to the presence of environmental contaminants [
50].
This parametric representation provided valuable features that were later used as input for classification algorithms. Furthermore, the AR model serves as a low-complexity, interpretable tool for real-time biosignal synthesis, anomaly detection, and adaptive filtering in intelligent biosensing systems [
51].
3.8. Classification Algorithm Performance and Confusion Matrix Analysis
Table 5 summarizes the evaluation metrics for the three classifiers. XGBoost obtained the best overall results, with both precision and recall at 0.94, showing reliable detection of both Pb-contaminated and control signals. SVM reached a precision of 0.90 but a lower recall of 0.82, indicating missed contaminated cases. Random Forest achieved balanced values (precision and recall at 0.91), slightly below XGBoost in accuracy.
To statistically compare model performances, 95% confidence intervals were computed for all evaluation metrics using 1000 bootstrap resamples. Pairwise comparisons between classifiers were performed using the Wilcoxon signed-rank test on the cross-validation fold scores. Additionally, permutation tests () were conducted to assess the probability of obtaining the observed accuracies under a null hypothesis of label exchangeability.
The results indicated that XGBoost significantly outperformed both SVM and Random Forest in accuracy (
p = 0.011 vs. RF,
p = 0.007 vs. SVM), precision (
p = 0.018 vs. RF,
p = 0.013 vs. SVM), and recall (
p = 0.017 vs. RF,
p = 0.010 vs. SVM). All permutation test
p-values were below 0.01, confirming that the observed differences in performance are highly unlikely to be due to random chance.
Table 6 summarizes the statistical comparisons.
These results confirm that XGBoost was the most accurate and balanced model, making it the best option for classifying bioelectrical signals in environmental contamination studies. This aligns with previous studies reporting that XGBoost performs well on small datasets with high-dimensional features [
37,
55,
56].
To assess classification performance, confusion matrices were generated for each model. As shown in
Figure 12, all classifiers performed well, with XGBoost showing the best performance.
Based on the confusion matrices shown in
Figure 12, additional metrics were calculated to provide a more comprehensive evaluation of classifier performance. In particular, sensitivity (recall) and specificity were assessed to quantify the ability of each model to correctly identify both lead-contaminated and uncontaminated signals. XGBoost achieved the highest sensitivity (0.94) and specificity (0.95), confirming its balanced capability to minimize both false negatives and false positives. Random Forest followed closely with a sensitivity of 0.92 and specificity of 0.94, maintaining strong overall performance. In contrast, SVM exhibited lower sensitivity (0.83) but higher specificity (0.95), indicating a stronger tendency to correctly classify uncontaminated samples while missing a greater proportion of contaminated signals.
These results further support the discussion that XGBoost achieves the most favorable balance between detecting lead-contaminated signals and correctly identifying uncontaminated signals—an essential requirement in environmental monitoring, where both false negatives and false positives must be minimized. The dataset analyzed in this study comprised 320 labeled instances, obtained from two measurements per plant across 80 plants (160 samples per group: control and lead-contaminated).
Subsequently, to smooth out the overfitting, a stratified five-fold cross-validation scheme was combined with 1000 bootstrap resamples to derive mean performance metrics and their 95% confidence intervals. In addition, permutation tests (
) were conducted for each classifier to determine the likelihood of achieving the observed classification accuracies under the null hypothesis of label exchangeability. All statistical analyses confirmed that the classification performance was significantly higher than random chance, indicating that the obtained results are robust and unlikely to be explained by statistical artifacts. The bootstrap-derived mean values and 95% confidence intervals for accuracy, precision, recall, and F1-score of the XGBoost model are summarized in
Table 7.
The implementation of three distinct classification algorithms with complementary strengths, combined with systematic hyperparameter tuning via grid search and cross-validation, reduced the likelihood that the observed performance was influenced by the bias of a single model. This methodological approach provides greater confidence that the extracted bioelectrical signal features are genuinely discriminative of lead exposure, rather than artifacts generated by a specific classification technique.
The bootstrap-derived metrics confirm that the XGBoost model is stable, with narrow confidence intervals across all evaluation measures. This indicates that the classifier sustains high performance even under resampling variability, supporting its generalization capability despite the moderate dataset size. However, while these quantitative results show that XGBoost can reliably distinguish between Pb-contaminated and control signals, they do not reveal the basis of its decisions. To address this, SHAP (SHapley Additive exPlanations) analysis was applied to identify the features that most influenced the model’s predictions [
23]. This approach links statistical performance with the physiological signal characteristics relevant for classification.
Figure 13 and
Figure 14 display the global average impact and instance-wise contributions of each feature, respectively.
The SHAP analysis for both Class 1 (
polluted) and Class 0 (
unpolluted) predictions consistently identified wavelet-based entropy, energy features, and autoregressive (AR) residuals as the main contributors to the model output. These findings support the physiological relevance of the extracted descriptors and align with previous biosensing studies where entropy and wavelet coefficients were among the most predictive features [
57].
In particular, mid-level wavelet coefficients (levels 3 and 4) and entropy measures were the most influential features in the SHAP rankings for both classes. Physiologically, these coefficients reflect signal variations over intermediate time windows, linked to ionic fluxes across cell membranes and turgor pressure dynamics. Under Pb stress, disruptions in calcium and potassium homeostasis alter plant electrical potentials, which appear as changes in energy within these frequency bands [
6,
25,
26].
Entropy measures quantify the unpredictability or complexity of the bioelectrical signal. Higher entropy may indicate reduced coordination of electrical activity, potentially linked to membrane damage, altered stomatal function, or impaired phloem transport under heavy metal exposure [
58]. The high SHAP importance of entropy features therefore suggests that Pb contamination alters the electrophysiological organization of
Aloe vera.
The predominance of mid-level wavelet coefficients and entropy measures in both Class 1 and Class 0 models is consistent with known physiological responses to heavy metal stress. Intermediate frequency components are associated with active transport and vascular signaling in the phloem and xylem, processes that can be disrupted by Pb-induced oxidative stress and ion imbalance. The agreement between SHAP-derived features and established stress physiology supports the biological relevance of the model’s predictions.
Following the SHAP-based interpretation, an additional analysis evaluated whether measurement order influenced the bioelectrical signal patterns. This step aimed to rule out temporal drift or environmental changes during acquisition that might obscure Pb-induced alterations.
To this end, a correlation analysis was performed between the chronological measurement index and key amplitude-related features. Spearman’s rank correlation coefficient (
) was used, given the non-parametric nature of the data. As shown in
Table 8, no significant associations were found (
) for mean amplitude, peak-to-peak value, or wavelet level 3 energy. These results indicate that measurement order did not affect the recordings.
The absence of significant correlations supports the robustness of the experimental protocol against temporal bias. However, this analysis cannot replace longitudinal monitoring. Future studies should include repeated measurements across multiple time points for each plant, combined with repeated-measures ANOVA, to better capture intra-plant variability and temporal effects.
This study shows that Pb exposure alters the bioelectrical signals of Aloe vera, but other factors may also affect these signals. Inter-plant variability, electrode placement, and soil moisture can influence amplitude and frequency independently of Pb. To reduce these effects, plants of similar age and size were used, maintained under controlled conditions, with standardized electrode placement and monitored soil moisture. These measures reduced variability but did not eliminate all confounders. Future work should use larger samples, repeated measurements, and explicit modeling of these factors to assess their impact relative to Pb exposure.
Overall, results from classification metrics, SHAP analysis, and correlation tests support the reliability of the proposed approach for detecting Pb stress in Aloe vera. XGBoost showed the best performance, with mid-level wavelet features and entropy measures consistently emerging as key predictors of Pb-related changes. No correlations with measurement order were found, and strict controls reduced the effect of confounding factors. Some limitations remain, including the moderate sample size and natural biological variability. Still, the agreement between statistical, computational, and physiological evidence provides a solid basis for applying this method in environmental biosensing. Future work should use larger datasets, repeated measurements, and validation across species to improve generalization.